Long delay on estimation_timer causes packet latency

All of lore.kernel.org
 help / color / mirror / Atom feed

* Long delay on estimation_timer causes packet latency
@ 2020-04-16 22:00 ` yunhong-cgl jiang
  0 siblings, 0 replies; 10+ messages in thread
From: yunhong-cgl jiang @ 2020-04-16 22:00 UTC (permalink / raw)
  To: horms, ja; +Cc: netdev, lvs-devel, Yunhong Jiang

Hi, Simon & Julian,
	We noticed that on our kubernetes node utilizing IPVS, the estimation_timer() takes very long (>200sm as shown below). Such long delay on timer softirq causes long packet latency.  

          <idle>-0     [007] dNH. 25652945.670814: softirq_raise: vec=1 [action=TIMER]
.....
          <idle>-0     [007] .Ns. 25652945.992273: softirq_exit: vec=1 [action=TIMER]

	The long latency is caused by the big service number (>50k) and large CPU number (>80 CPUs),

	We tried to move the timer function into a kernel thread so that it will not block the system and seems solves our problem. Is this the right direction? If yes, we will do more testing and send out the RFC patch. If not, can you give us some suggestion?

Thanks
—yunhong 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Long delay on estimation_timer causes packet latency
@ 2020-04-16 22:00 ` yunhong-cgl jiang
  0 siblings, 0 replies; 10+ messages in thread
From: yunhong-cgl jiang @ 2020-04-16 22:00 UTC (permalink / raw)
  To: horms, ja; +Cc: netdev, lvs-devel, Yunhong Jiang

Hi, Simon & Julian,
	We noticed that on our kubernetes node utilizing IPVS, the estimation_timer() takes very long (>200sm as shown below). Such long delay on timer softirq causes long packet latency.  

          <idle>-0     [007] dNH. 25652945.670814: softirq_raise: vec=1 [action=TIMER]
.....
          <idle>-0     [007] .Ns. 25652945.992273: softirq_exit: vec=1 [action=TIMER]

	The long latency is caused by the big service number (>50k) and large CPU number (>80 CPUs),

	We tried to move the timer function into a kernel thread so that it will not block the system and seems solves our problem. Is this the right direction? If yes, we will do more testing and send out the RFC patch. If not, can you give us some suggestion?

Thanks
â€”yunhong 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Long delay on estimation_timer causes packet latency
  2020-04-16 22:00 ` yunhong-cgl jiang
  (?)
@ 2020-04-17  7:47 ` Julian Anastasov
  2020-04-17 16:56     ` yunhong-cgl jiang
  -1 siblings, 1 reply; 10+ messages in thread
From: Julian Anastasov @ 2020-04-17  7:47 UTC (permalink / raw)
  To: yunhong-cgl jiang; +Cc: horms, netdev, lvs-devel, Yunhong Jiang


	Hello,

On Thu, 16 Apr 2020, yunhong-cgl jiang wrote:

> Hi, Simon & Julian,
> 	We noticed that on our kubernetes node utilizing IPVS, the estimation_timer() takes very long (>200sm as shown below). Such long delay on timer softirq causes long packet latency.  
> 
>           <idle>-0     [007] dNH. 25652945.670814: softirq_raise: vec=1 [action=TIMER]
> .....
>           <idle>-0     [007] .Ns. 25652945.992273: softirq_exit: vec=1 [action=TIMER]
> 
> 	The long latency is caused by the big service number (>50k) and large CPU number (>80 CPUs),
> 
> 	We tried to move the timer function into a kernel thread so that it will not block the system and seems solves our problem. Is this the right direction? If yes, we will do more testing and send out the RFC patch. If not, can you give us some suggestion?

	Using kernel thread is a good idea. For this to work, we can
also remove the est_lock and to use RCU for est_list.
The writers ip_vs_start_estimator() and ip_vs_stop_estimator() already
run under common mutex __ip_vs_mutex, so they not need any
synchronization. We need _bh lock usage in estimation_timer().
Let me know if you need any help with the patch.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Long delay on estimation_timer causes packet latency
  2020-04-17  7:47 ` Julian Anastasov
@ 2020-04-17 16:56     ` yunhong-cgl jiang
  0 siblings, 0 replies; 10+ messages in thread
From: yunhong-cgl jiang @ 2020-04-17 16:56 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: horms, netdev, lvs-devel, Yunhong Jiang

Thanks for reply.

Yes, our patch changes the est_list to a RCU list. Will do more testing and send out the patch.

Thanks
—Yunhong


> On Apr 17, 2020, at 12:47 AM, Julian Anastasov <ja@ssi.bg> wrote:
> 
> 
> 	Hello,
> 
> On Thu, 16 Apr 2020, yunhong-cgl jiang wrote:
> 
>> Hi, Simon & Julian,
>> 	We noticed that on our kubernetes node utilizing IPVS, the estimation_timer() takes very long (>200sm as shown below). Such long delay on timer softirq causes long packet latency.  
>> 
>>          <idle>-0     [007] dNH. 25652945.670814: softirq_raise: vec=1 [action=TIMER]
>> .....
>>          <idle>-0     [007] .Ns. 25652945.992273: softirq_exit: vec=1 [action=TIMER]
>> 
>> 	The long latency is caused by the big service number (>50k) and large CPU number (>80 CPUs),
>> 
>> 	We tried to move the timer function into a kernel thread so that it will not block the system and seems solves our problem. Is this the right direction? If yes, we will do more testing and send out the RFC patch. If not, can you give us some suggestion?
> 
> 	Using kernel thread is a good idea. For this to work, we can
> also remove the est_lock and to use RCU for est_list.
> The writers ip_vs_start_estimator() and ip_vs_stop_estimator() already
> run under common mutex __ip_vs_mutex, so they not need any
> synchronization. We need _bh lock usage in estimation_timer().
> Let me know if you need any help with the patch.
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Long delay on estimation_timer causes packet latency
@ 2020-04-17 16:56     ` yunhong-cgl jiang
  0 siblings, 0 replies; 10+ messages in thread
From: yunhong-cgl jiang @ 2020-04-17 16:56 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: horms, netdev, lvs-devel, Yunhong Jiang

Thanks for reply.

Yes, our patch changes the est_list to a RCU list. Will do more testing and send out the patch.

Thanks
â€”Yunhong


> On Apr 17, 2020, at 12:47 AM, Julian Anastasov <ja@ssi.bg> wrote:
> 
> 
> 	Hello,
> 
> On Thu, 16 Apr 2020, yunhong-cgl jiang wrote:
> 
>> Hi, Simon & Julian,
>> 	We noticed that on our kubernetes node utilizing IPVS, the estimation_timer() takes very long (>200sm as shown below). Such long delay on timer softirq causes long packet latency.  
>> 
>>          <idle>-0     [007] dNH. 25652945.670814: softirq_raise: vec=1 [action=TIMER]
>> .....
>>          <idle>-0     [007] .Ns. 25652945.992273: softirq_exit: vec=1 [action=TIMER]
>> 
>> 	The long latency is caused by the big service number (>50k) and large CPU number (>80 CPUs),
>> 
>> 	We tried to move the timer function into a kernel thread so that it will not block the system and seems solves our problem. Is this the right direction? If yes, we will do more testing and send out the RFC patch. If not, can you give us some suggestion?
> 
> 	Using kernel thread is a good idea. For this to work, we can
> also remove the est_lock and to use RCU for est_list.
> The writers ip_vs_start_estimator() and ip_vs_stop_estimator() already
> run under common mutex __ip_vs_mutex, so they not need any
> synchronization. We need _bh lock usage in estimation_timer().
> Let me know if you need any help with the patch.
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Long delay on estimation_timer causes packet latency
  2020-04-17 16:56     ` yunhong-cgl jiang
  (?)
@ 2020-12-03  6:42     ` dust.li
  2020-12-03  8:48       ` Julian Anastasov
  -1 siblings, 1 reply; 10+ messages in thread
From: dust.li @ 2020-12-03  6:42 UTC (permalink / raw)
  To: yunhong-cgl jiang, Julian Anastasov
  Cc: horms, netdev, lvs-devel, Yunhong Jiang

Hi Yunhong & Julian, any updates ?


We've encountered the same problem. With lots of ipvs

services plus many CPUs, it's easy to reproduce this issue.

I have a simple script to reproduce:

First add many ipvs services:

for((i=0;i<50000;i++)); do
         ipvsadm -A -t 10.10.10.10:$((2000+$i))
done


Then, check the latency of estimation_timer() using bpftrace:

#!/usr/bin/bpftrace

kprobe:estimation_timer {
         @enter = nsecs;
}

kretprobe:estimation_timer {
         $exit = nsecs;
         printf("latency: %ld us\n", (nsecs - @enter)/1000);
}

I observed about 268ms delay on my 104 CPUs test server.

Attaching 2 probes...
latency: 268807 us
latency: 268519 us
latency: 269263 us


And I tried moving estimation_timer() into a delayed

workqueue, this do make things better. But since the

estimation won't give up CPU, it can run for pretty

long without scheduling on a server which don't have

preempt enabled, so tasks on that CPU can't get executed

during that period.


Since the estimation repeated every 2s, we can't call

cond_resched() to give up CPU in the middle of iterating the

est_list, or the estimation will be quite inaccurate.

Besides the est_list needs to be protected.


I haven't found any ideal solution yet, currently, we just

moved the estimation into kworker and add sysctl to allow

us to disable the estimation, since we don't need the

estimation anyway.


Our patches is pretty simple now, if you think it's useful,

I can paste them


Do you guys have any suggestions or solutions ?


Thanks a lot !

Dust



On 4/18/20 12:56 AM, yunhong-cgl jiang wrote:
> Thanks for reply.
>
> Yes, our patch changes the est_list to a RCU list. Will do more testing and send out the patch.
>
> Thanks
> —Yunhong
>
>
>> On Apr 17, 2020, at 12:47 AM, Julian Anastasov <ja@ssi.bg> wrote:
>>
>>
>> 	Hello,
>>
>> On Thu, 16 Apr 2020, yunhong-cgl jiang wrote:
>>
>>> Hi, Simon & Julian,
>>> 	We noticed that on our kubernetes node utilizing IPVS, the estimation_timer() takes very long (>200sm as shown below). Such long delay on timer softirq causes long packet latency.
>>>
>>>           <idle>-0     [007] dNH. 25652945.670814: softirq_raise: vec=1 [action=TIMER]
>>> .....
>>>           <idle>-0     [007] .Ns. 25652945.992273: softirq_exit: vec=1 [action=TIMER]
>>>
>>> 	The long latency is caused by the big service number (>50k) and large CPU number (>80 CPUs),
>>>
>>> 	We tried to move the timer function into a kernel thread so that it will not block the system and seems solves our problem. Is this the right direction? If yes, we will do more testing and send out the RFC patch. If not, can you give us some suggestion?
>> 	Using kernel thread is a good idea. For this to work, we can
>> also remove the est_lock and to use RCU for est_list.
>> The writers ip_vs_start_estimator() and ip_vs_stop_estimator() already
>> run under common mutex __ip_vs_mutex, so they not need any
>> synchronization. We need _bh lock usage in estimation_timer().
>> Let me know if you need any help with the patch.
>>
>> Regards
>>
>> --
>> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Long delay on estimation_timer causes packet latency
  2020-12-03  6:42     ` dust.li
@ 2020-12-03  8:48       ` Julian Anastasov
  2020-12-04  3:27         ` dust.li
  0 siblings, 1 reply; 10+ messages in thread
From: Julian Anastasov @ 2020-12-03  8:48 UTC (permalink / raw)
  To: dust.li; +Cc: yunhong-cgl jiang, horms, netdev, lvs-devel, Yunhong Jiang

[-- Attachment #1: Type: text/plain, Size: 4728 bytes --]


	Hello,

On Thu, 3 Dec 2020, dust.li wrote:

> Hi Yunhong & Julian, any updates ?
> 
> 
> We've encountered the same problem. With lots of ipvs
> 
> services plus many CPUs, it's easy to reproduce this issue.
> 
> I have a simple script to reproduce:
> 
> First add many ipvs services:
> 
> for((i=0;i<50000;i++)); do
>         ipvsadm -A -t 10.10.10.10:$((2000+$i))
> done
> 
> 
> Then, check the latency of estimation_timer() using bpftrace:
> 
> #!/usr/bin/bpftrace
> 
> kprobe:estimation_timer {
>         @enter = nsecs;
> }
> 
> kretprobe:estimation_timer {
>         $exit = nsecs;
>         printf("latency: %ld us\n", (nsecs - @enter)/1000);
> }
> 
> I observed about 268ms delay on my 104 CPUs test server.
> 
> Attaching 2 probes...
> latency: 268807 us
> latency: 268519 us
> latency: 269263 us
> 
> 
> And I tried moving estimation_timer() into a delayed
> 
> workqueue, this do make things better. But since the
> 
> estimation won't give up CPU, it can run for pretty
> 
> long without scheduling on a server which don't have
> 
> preempt enabled, so tasks on that CPU can't get executed
> 
> during that period.
> 
> 
> Since the estimation repeated every 2s, we can't call
> 
> cond_resched() to give up CPU in the middle of iterating the
> 
> est_list, or the estimation will be quite inaccurate.
> 
> Besides the est_list needs to be protected.
> 
> 
> I haven't found any ideal solution yet, currently, we just
> 
> moved the estimation into kworker and add sysctl to allow
> 
> us to disable the estimation, since we don't need the
> 
> estimation anyway.
> 
> 
> Our patches is pretty simple now, if you think it's useful,
> 
> I can paste them
> 
> 
> Do you guys have any suggestions or solutions ?

	When I saw the report first time, I thought on this
issue and here is what I think:

- single delayed work (slow to stop them if using many)

- the delayed work walks say 64 lists with estimators and
reschedules itself for the next 30ms, as result, 30ms*64=1920ms,
80ms will be idle up to the 2000ms period for estimation
for every list. As result, every list is walked once per 2000ms.
If 30ms is odd for all kind of HZ values, this can be
20ms * 100.

- work will use spin_lock_bh(&s->lock) to protect the
entries, we do not want delays between /proc readers and
the work if using mutex. But _bh locks stop timers and
networking for short time :( Not sure yet if just spin_lock
is safe for both /proc and estimator's work.

- while walking one of the 64 lists we should use just
rcu read locking for the current list, no cond_resched_rcu
because we do not know how to synchronize while entries are
removed at the same time. For now using array with 64 lists
solves this problem.

- the algorith will look like this:

	int row = 0;

	for () {
		rcu_read_lock();
		list_for_each_entry_rcu(e, &ipvs->est_list[row], list) {

		...

			/* Should we stop immediately ? */
			if (!ipvs->enable || stopping delayed work) {
				rcu_read_unlock();
				goto out;
			}
		}
		/* rcu_read_unlock will reschedule if needed
		 * but we know where to start from the next time,
		 * i.e. from next slot
		 */
		rcu_read_unlock();
		reschedule delayed work for +30ms or +110ms if last row
		by using queue_delayed_work*()
		row ++;
		if (row >= 64)
			row = 0;
	}

out:

- the drawback is that without cond_resched_rcu we are not
fair to other processes, solution is needed, we just reduce
the problem by using 64 lists which can be imbalanced.

- next goal: entries should be removed with list_del_rcu,
without any locks, we do not want to delay configurations,
ipvs->est_lock should be removed.

- now the problem is how to spread the entries to 64 lists.
One way is randomly, in this case first estimation may be
for less than 2000ms. In this case, less entries means
less work for all 64 steps. But what if entries are removed
and some slots become empty and others too large?

- if we want to make the work equal for all 64 passes, we
can rebalance the lists, say on every 2000ms move some
entries to neighbour list. As result, one entry can be
estimated after 1970ms or 2030ms. But this is complication
not possible with the RCU locking, we need a safe way
to move entries to neighbour list, say if work walks
row 0 we can rebalance between rows 32 and 33 which are
1 second away of row 0. And not all list primitives allow
it for _rcu.

- next options is to insert entries in some current list,
if their count reaches, say 128, then move to the next list
for inserting. This option tries to provide exact 2000ms
delay for the first estimation for the newly added entry.

	We can start with some implementation and see if
your tests are happy.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Long delay on estimation_timer causes packet latency
  2020-12-03  8:48       ` Julian Anastasov
@ 2020-12-04  3:27         ` dust.li
  2020-12-04  5:42           ` Julian Anastasov
  0 siblings, 1 reply; 10+ messages in thread
From: dust.li @ 2020-12-04  3:27 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: yunhong-cgl jiang, horms, netdev, lvs-devel, Yunhong Jiang


On 12/3/20 4:48 PM, Julian Anastasov wrote:
> 	Hello,
>
> On Thu, 3 Dec 2020, dust.li wrote:
>
>> Hi Yunhong & Julian, any updates ?
>>
>>
>> We've encountered the same problem. With lots of ipvs
>>
>> services plus many CPUs, it's easy to reproduce this issue.
>>
>> I have a simple script to reproduce:
>>
>> First add many ipvs services:
>>
>> for((i=0;i<50000;i++)); do
>>          ipvsadm -A -t 10.10.10.10:$((2000+$i))
>> done
>>
>>
>> Then, check the latency of estimation_timer() using bpftrace:
>>
>> #!/usr/bin/bpftrace
>>
>> kprobe:estimation_timer {
>>          @enter = nsecs;
>> }
>>
>> kretprobe:estimation_timer {
>>          $exit = nsecs;
>>          printf("latency: %ld us\n", (nsecs - @enter)/1000);
>> }
>>
>> I observed about 268ms delay on my 104 CPUs test server.
>>
>> Attaching 2 probes...
>> latency: 268807 us
>> latency: 268519 us
>> latency: 269263 us
>>
>>
>> And I tried moving estimation_timer() into a delayed
>>
>> workqueue, this do make things better. But since the
>>
>> estimation won't give up CPU, it can run for pretty
>>
>> long without scheduling on a server which don't have
>>
>> preempt enabled, so tasks on that CPU can't get executed
>>
>> during that period.
>>
>>
>> Since the estimation repeated every 2s, we can't call
>>
>> cond_resched() to give up CPU in the middle of iterating the
>>
>> est_list, or the estimation will be quite inaccurate.
>>
>> Besides the est_list needs to be protected.
>>
>>
>> I haven't found any ideal solution yet, currently, we just
>>
>> moved the estimation into kworker and add sysctl to allow
>>
>> us to disable the estimation, since we don't need the
>>
>> estimation anyway.
>>
>>
>> Our patches is pretty simple now, if you think it's useful,
>>
>> I can paste them
>>
>>
>> Do you guys have any suggestions or solutions ?
> 	When I saw the report first time, I thought on this
> issue and here is what I think:
>
> - single delayed work (slow to stop them if using many)
>
> - the delayed work walks say 64 lists with estimators and
> reschedules itself for the next 30ms, as result, 30ms*64=1920ms,
> 80ms will be idle up to the 2000ms period for estimation
> for every list. As result, every list is walked once per 2000ms.
> If 30ms is odd for all kind of HZ values, this can be
> 20ms * 100.
>
> - work will use spin_lock_bh(&s->lock) to protect the
> entries, we do not want delays between /proc readers and
> the work if using mutex. But _bh locks stop timers and
> networking for short time :( Not sure yet if just spin_lock
> is safe for both /proc and estimator's work.
>
> - while walking one of the 64 lists we should use just
> rcu read locking for the current list, no cond_resched_rcu
> because we do not know how to synchronize while entries are
> removed at the same time. For now using array with 64 lists
> solves this problem.
>
> - the algorith will look like this:
>
> 	int row = 0;
>
> 	for () {
> 		rcu_read_lock();
> 		list_for_each_entry_rcu(e, &ipvs->est_list[row], list) {
>
> 		...
>
> 			/* Should we stop immediately ? */
> 			if (!ipvs->enable || stopping delayed work) {
> 				rcu_read_unlock();
> 				goto out;
> 			}
> 		}
> 		/* rcu_read_unlock will reschedule if needed
> 		 * but we know where to start from the next time,
> 		 * i.e. from next slot
> 		 */
> 		rcu_read_unlock();
> 		reschedule delayed work for +30ms or +110ms if last row
> 		by using queue_delayed_work*()
> 		row ++;
> 		if (row >= 64)
> 			row = 0;
> 	}
>
> out:
>
> - the drawback is that without cond_resched_rcu we are not
> fair to other processes, solution is needed, we just reduce
> the problem by using 64 lists which can be imbalanced.
>
> - next goal: entries should be removed with list_del_rcu,
> without any locks, we do not want to delay configurations,
> ipvs->est_lock should be removed.
>
> - now the problem is how to spread the entries to 64 lists.
> One way is randomly, in this case first estimation may be
> for less than 2000ms. In this case, less entries means
> less work for all 64 steps. But what if entries are removed
> and some slots become empty and others too large?
>
> - if we want to make the work equal for all 64 passes, we
> can rebalance the lists, say on every 2000ms move some
> entries to neighbour list. As result, one entry can be
> estimated after 1970ms or 2030ms. But this is complication
> not possible with the RCU locking, we need a safe way
> to move entries to neighbour list, say if work walks
> row 0 we can rebalance between rows 32 and 33 which are
> 1 second away of row 0. And not all list primitives allow
> it for _rcu.
>
> - next options is to insert entries in some current list,
> if their count reaches, say 128, then move to the next list
> for inserting. This option tries to provide exact 2000ms
> delay for the first estimation for the newly added entry.
>
> 	We can start with some implementation and see if
> your tests are happy.

Thanks for sharing your thoughts !


I think it's a good idea to split the est_list into different

slots, I believe it will dramatically reduce the delay brought

by estimation.


My only concern is the cost of the estimation when the number of

services is large. Splitting the est_list won't reduce the real

work to do.

In our case, each estimation cost at most 268ms/2000ms, which is

about 13% of one CPU hyper-thread, and this should be a common case

in a large K8S cluster with lots of services.

Since the estimation is not needed in our environment at all, it's

just a waste of CPU resource. Have you ever consider add a switch to

let the user turn the estimator off ?


Thanks !

Dust



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Long delay on estimation_timer causes packet latency
  2020-12-04  3:27         ` dust.li
@ 2020-12-04  5:42           ` Julian Anastasov
  2020-12-04 13:59             ` dust.li
  0 siblings, 1 reply; 10+ messages in thread
From: Julian Anastasov @ 2020-12-04  5:42 UTC (permalink / raw)
  To: dust.li; +Cc: yunhong-cgl jiang, horms, netdev, lvs-devel, Yunhong Jiang


	Hello,

On Fri, 4 Dec 2020, dust.li wrote:

> 
> On 12/3/20 4:48 PM, Julian Anastasov wrote:
> >
> > - work will use spin_lock_bh(&s->lock) to protect the
> > entries, we do not want delays between /proc readers and
> > the work if using mutex. But _bh locks stop timers and
> > networking for short time :( Not sure yet if just spin_lock
> > is safe for both /proc and estimator's work.

	Here stopping BH is may be not so fatal if some
CPUs are used for networking and others for workqueues.

> Thanks for sharing your thoughts !
> 
> 
> I think it's a good idea to split the est_list into different
> 
> slots, I believe it will dramatically reduce the delay brought
> 
> by estimation.

	268ms/64 => 4ms average. As the estimation with single
work does not utilize many CPUs simultaneously, this can be a
problem for 300000-400000 services but this looks crazy.

> My only concern is the cost of the estimation when the number of
> 
> services is large. Splitting the est_list won't reduce the real
> 
> work to do.
> 
> In our case, each estimation cost at most 268ms/2000ms, which is
> 
> about 13% of one CPU hyper-thread, and this should be a common case
> 
> in a large K8S cluster with lots of services.
> 
> Since the estimation is not needed in our environment at all, it's
> 
> just a waste of CPU resource. Have you ever consider add a switch to
> 
> let the user turn the estimator off ?

	No problem to add sysctl var for this, we usually add function
to check which can be used in ip_vs_in_stats, ip_vs_out_stats,
ip_vs_conn_stats. If switch can be changed at any time, what should
we do? Add/Del est entries as normally but do not start the
delayed work if flag disables stats. When flag is enabled counters
will increase and we will start delayed work.

Regards

--
Julian Anastasov <ja@ssi.bg>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Long delay on estimation_timer causes packet latency
  2020-12-04  5:42           ` Julian Anastasov
@ 2020-12-04 13:59             ` dust.li
  0 siblings, 0 replies; 10+ messages in thread
From: dust.li @ 2020-12-04 13:59 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: yunhong-cgl jiang, horms, netdev, lvs-devel, Yunhong Jiang

On Fri, Dec 04, 2020 at 07:42:56AM +0200, Julian Anastasov wrote:
>
>	Hello,
>
>On Fri, 4 Dec 2020, dust.li wrote:
>
>> 
>> On 12/3/20 4:48 PM, Julian Anastasov wrote:
>> >
>> > - work will use spin_lock_bh(&s->lock) to protect the
>> > entries, we do not want delays between /proc readers and
>> > the work if using mutex. But _bh locks stop timers and
>> > networking for short time :( Not sure yet if just spin_lock
>> > is safe for both /proc and estimator's work.
>
>	Here stopping BH is may be not so fatal if some
>CPUs are used for networking and others for workqueues.
>
>> Thanks for sharing your thoughts !
>> 
>> 
>> I think it's a good idea to split the est_list into different
>> 
>> slots, I believe it will dramatically reduce the delay brought
>> 
>> by estimation.
>
>	268ms/64 => 4ms average. As the estimation with single
>work does not utilize many CPUs simultaneously, this can be a
>problem for 300000-400000 services but this looks crazy.
Yes. Consider the largest server we use now, which has 256 HT
servers with 4 NUMA nodes. Even that should not be a big problem.

>
>> My only concern is the cost of the estimation when the number of
>> 
>> services is large. Splitting the est_list won't reduce the real
>> 
>> work to do.
>> 
>> In our case, each estimation cost at most 268ms/2000ms, which is
>> 
>> about 13% of one CPU hyper-thread, and this should be a common case
>> 
>> in a large K8S cluster with lots of services.
>> 
>> Since the estimation is not needed in our environment at all, it's
>> 
>> just a waste of CPU resource. Have you ever consider add a switch to
>> 
>> let the user turn the estimator off ?
>
>	No problem to add sysctl var for this, we usually add function
>to check which can be used in ip_vs_in_stats, ip_vs_out_stats,
>ip_vs_conn_stats. If switch can be changed at any time, what should
>we do? Add/Del est entries as normally but do not start the
>delayed work if flag disables stats. When flag is enabled counters
>will increase and we will start delayed work.

Yes, this would be perfect for me !

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2020-12-04 14:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-16 22:00 Long delay on estimation_timer causes packet latency yunhong-cgl jiang
2020-04-16 22:00 ` yunhong-cgl jiang
2020-04-17  7:47 ` Julian Anastasov
2020-04-17 16:56   ` yunhong-cgl jiang
2020-04-17 16:56     ` yunhong-cgl jiang
2020-12-03  6:42     ` dust.li
2020-12-03  8:48       ` Julian Anastasov
2020-12-04  3:27         ` dust.li
2020-12-04  5:42           ` Julian Anastasov
2020-12-04 13:59             ` dust.li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.