All of lore.kernel.org
 help / color / mirror / Atom feed
* sys_epoll_wait high CPU load in 2.6.37
@ 2011-01-26  0:09 Simon Kirby
  2011-01-26  7:18 ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Simon Kirby @ 2011-01-26  0:09 UTC (permalink / raw)
  To: linux-kernel, Shawn Bohrer, Eric Dumazet

Hello!

Since upgrading 2.6.36 -> 2.6.37, dovecot's "anvil" process seems to end
up taking a lot more time in "top", and "perf top" shows output like this
(system-wide):

  samples  pcnt function                      DSO
  _______ _____ _____________________________ __________________________

  2405.00 68.8% sys_epoll_wait                [kernel.kallsyms]
    33.00  0.9% mail_cache_lookup_iter_next   libdovecot-storage.so.0.0.0
    30.00  0.9% _raw_spin_lock                [kernel.kallsyms]
    ...etc...

It only wakes up 5-10 times per second or so (on this box), and does
stuff like this:

epoll_wait(12, {{EPOLLIN, {u32=19417616, u64=19417616}}}, 25, 2147483647) = 1
read(29, "PENALTY-GET\t192.168.31.10\n"..., 738) = 26
write(29, "0 0\n"..., 4)                = 4
epoll_wait(12, {{EPOLLIN, {u32=19395632, u64=19395632}}}, 25, 2147483647) = 1
read(18, "LOOKUP\tpop3/192.168.31.10/tshield"..., 668) = 58
write(18, "0\n"..., 2)                  = 2
epoll_wait(12, {{EPOLLIN, {u32=19373072, u64=19373072}}}, 25, 2147483647) = 1
read(7, "CONNECT\t3490\tpop3/192.168.31.10/t"..., 254) = 64
epoll_wait(12, {{EPOLLIN, {u32=19373072, u64=19373072}}}, 25, 2147483647) = 1
read(7, "DISCONNECT\t3482\tpop3/192.168.31.1"..., 190) = 62

Anything obvious here?  anvil talks over UNIX sockets to the rest of
dovecot, and uses epoll_wait.  So, suspect commits might be:

95aac7b1cd224f568fb83937044cd303ff11b029
5456f09aaf88731e16dbcea7522cb330b6846415
or other bits from
 git log v2.6.36..v2.6.37 net/unix/af_unix.c fs/eventpoll.c

I suspect it has something to do with that "infinite value" check removal
in that first commit.  It doesn't show up easily on a test box, but I can
try reverting 95aac7b1cd in production if it's not obvious.

Simon-

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26  0:09 sys_epoll_wait high CPU load in 2.6.37 Simon Kirby
@ 2011-01-26  7:18 ` Eric Dumazet
  2011-01-26 11:16   ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-01-26  7:18 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-kernel, Shawn Bohrer, Davide Libenzi

Le mardi 25 janvier 2011 à 16:09 -0800, Simon Kirby a écrit :
> Hello!
> 
> Since upgrading 2.6.36 -> 2.6.37, dovecot's "anvil" process seems to end
> up taking a lot more time in "top", and "perf top" shows output like this
> (system-wide):
> 
>   samples  pcnt function                      DSO
>   _______ _____ _____________________________ __________________________
> 
>   2405.00 68.8% sys_epoll_wait                [kernel.kallsyms]
>     33.00  0.9% mail_cache_lookup_iter_next   libdovecot-storage.so.0.0.0
>     30.00  0.9% _raw_spin_lock                [kernel.kallsyms]
>     ...etc...
> 
> It only wakes up 5-10 times per second or so (on this box), and does
> stuff like this:
> 
> epoll_wait(12, {{EPOLLIN, {u32=19417616, u64=19417616}}}, 25, 2147483647) = 1
> read(29, "PENALTY-GET\t192.168.31.10\n"..., 738) = 26
> write(29, "0 0\n"..., 4)                = 4
> epoll_wait(12, {{EPOLLIN, {u32=19395632, u64=19395632}}}, 25, 2147483647) = 1
> read(18, "LOOKUP\tpop3/192.168.31.10/tshield"..., 668) = 58
> write(18, "0\n"..., 2)                  = 2
> epoll_wait(12, {{EPOLLIN, {u32=19373072, u64=19373072}}}, 25, 2147483647) = 1
> read(7, "CONNECT\t3490\tpop3/192.168.31.10/t"..., 254) = 64
> epoll_wait(12, {{EPOLLIN, {u32=19373072, u64=19373072}}}, 25, 2147483647) = 1
> read(7, "DISCONNECT\t3482\tpop3/192.168.31.1"..., 190) = 62
> 
> Anything obvious here?  anvil talks over UNIX sockets to the rest of
> dovecot, and uses epoll_wait.  So, suspect commits might be:
> 
> 95aac7b1cd224f568fb83937044cd303ff11b029
> 5456f09aaf88731e16dbcea7522cb330b6846415
> or other bits from
>  git log v2.6.36..v2.6.37 net/unix/af_unix.c fs/eventpoll.c
> 
> I suspect it has something to do with that "infinite value" check removal
> in that first commit.  It doesn't show up easily on a test box, but I can
> try reverting 95aac7b1cd in production if it's not obvious.
> 
> Simon-

Yes, 95aac7b1cd is the problem, but anvil should use a 0 (no) timeout
instead of 2147483647 ms : epoll_wait() doesnt have to arm a timer in
this case, it is a bit faster.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26  7:18 ` Eric Dumazet
@ 2011-01-26 11:16   ` Eric Dumazet
  2011-01-26 15:31     ` Davide Libenzi
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-01-26 11:16 UTC (permalink / raw)
  To: Simon Kirby; +Cc: linux-kernel, Shawn Bohrer, Davide Libenzi, Andrew Morton

Le mercredi 26 janvier 2011 à 08:18 +0100, Eric Dumazet a écrit :
> Le mardi 25 janvier 2011 à 16:09 -0800, Simon Kirby a écrit :
> > Hello!
> > 
> > Since upgrading 2.6.36 -> 2.6.37, dovecot's "anvil" process seems to end
> > up taking a lot more time in "top", and "perf top" shows output like this
> > (system-wide):
> > 
> >   samples  pcnt function                      DSO
> >   _______ _____ _____________________________ __________________________
> > 
> >   2405.00 68.8% sys_epoll_wait                [kernel.kallsyms]
> >     33.00  0.9% mail_cache_lookup_iter_next   libdovecot-storage.so.0.0.0
> >     30.00  0.9% _raw_spin_lock                [kernel.kallsyms]
> >     ...etc...
> > 
> > It only wakes up 5-10 times per second or so (on this box), and does
> > stuff like this:
> > 
> > epoll_wait(12, {{EPOLLIN, {u32=19417616, u64=19417616}}}, 25, 2147483647) = 1
> > read(29, "PENALTY-GET\t192.168.31.10\n"..., 738) = 26
> > write(29, "0 0\n"..., 4)                = 4
> > epoll_wait(12, {{EPOLLIN, {u32=19395632, u64=19395632}}}, 25, 2147483647) = 1
> > read(18, "LOOKUP\tpop3/192.168.31.10/tshield"..., 668) = 58
> > write(18, "0\n"..., 2)                  = 2
> > epoll_wait(12, {{EPOLLIN, {u32=19373072, u64=19373072}}}, 25, 2147483647) = 1
> > read(7, "CONNECT\t3490\tpop3/192.168.31.10/t"..., 254) = 64
> > epoll_wait(12, {{EPOLLIN, {u32=19373072, u64=19373072}}}, 25, 2147483647) = 1
> > read(7, "DISCONNECT\t3482\tpop3/192.168.31.1"..., 190) = 62
> > 
> > Anything obvious here?  anvil talks over UNIX sockets to the rest of
> > dovecot, and uses epoll_wait.  So, suspect commits might be:
> > 
> > 95aac7b1cd224f568fb83937044cd303ff11b029
> > 5456f09aaf88731e16dbcea7522cb330b6846415
> > or other bits from
> >  git log v2.6.36..v2.6.37 net/unix/af_unix.c fs/eventpoll.c
> > 
> > I suspect it has something to do with that "infinite value" check removal
> > in that first commit.  It doesn't show up easily on a test box, but I can
> > try reverting 95aac7b1cd in production if it's not obvious.
> > 
> > Simon-
> 
> Yes, 95aac7b1cd is the problem, but anvil should use a 0 (no) timeout
> instead of 2147483647 ms : epoll_wait() doesnt have to arm a timer in
> this case, it is a bit faster.
> 
> 

Slowness comes from timespec_add_ns() : This one assumed small 'ns'
argument, since it wants to avoid a divide instruction.

static __always_inline void timespec_add_ns(struct timespec *a, u64 ns)
{
        a->tv_sec += __iter_div_u64_rem(a->tv_nsec + ns, NSEC_PER_SEC, &ns);
        a->tv_nsec = ns;
}

We should do this differently for epoll usage ;)

Please try following patch :

[PATCH] epoll: epoll_wait() should be careful in timespec_add_ns use

commit 95aac7b1cd224f (epoll: make epoll_wait() use the hrtimer range
feature) added a performance regression because it used
timespec_add_ns() with potential very large 'ns' values.

Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Shawn Bohrer <shawn.bohrer@gmail.com>
CC: Davide Libenzi <davidel@xmailserver.org>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 fs/eventpoll.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index cc8a9b7..7ec0890 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1126,7 +1126,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 
 	if (timeout > 0) {
 		ktime_get_ts(&end_time);
-		timespec_add_ns(&end_time, (u64)timeout * NSEC_PER_MSEC);
+		end_time.tv_sec += timeout / MSEC_PER_SEC;
+		timeout %= MSEC_PER_SEC;
+		timespec_add_ns(&end_time, timeout * NSEC_PER_MSEC);
 		slack = select_estimate_accuracy(&end_time);
 		to = &expires;
 		*to = timespec_to_ktime(end_time);



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26 11:16   ` Eric Dumazet
@ 2011-01-26 15:31     ` Davide Libenzi
  2011-01-26 15:43       ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Davide Libenzi @ 2011-01-26 15:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Simon Kirby, Linux Kernel Mailing List, Shawn Bohrer, Andrew Morton

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4048 bytes --]

On Wed, 26 Jan 2011, Eric Dumazet wrote:

> Le mercredi 26 janvier 2011 à 08:18 +0100, Eric Dumazet a écrit :
> > Le mardi 25 janvier 2011 à 16:09 -0800, Simon Kirby a écrit :
> > > Hello!
> > > 
> > > Since upgrading 2.6.36 -> 2.6.37, dovecot's "anvil" process seems to end
> > > up taking a lot more time in "top", and "perf top" shows output like this
> > > (system-wide):
> > > 
> > >   samples  pcnt function                      DSO
> > >   _______ _____ _____________________________ __________________________
> > > 
> > >   2405.00 68.8% sys_epoll_wait                [kernel.kallsyms]
> > >     33.00  0.9% mail_cache_lookup_iter_next   libdovecot-storage.so.0.0.0
> > >     30.00  0.9% _raw_spin_lock                [kernel.kallsyms]
> > >     ...etc...
> > > 
> > > It only wakes up 5-10 times per second or so (on this box), and does
> > > stuff like this:
> > > 
> > > epoll_wait(12, {{EPOLLIN, {u32=19417616, u64=19417616}}}, 25, 2147483647) = 1
> > > read(29, "PENALTY-GET\t192.168.31.10\n"..., 738) = 26
> > > write(29, "0 0\n"..., 4)                = 4
> > > epoll_wait(12, {{EPOLLIN, {u32=19395632, u64=19395632}}}, 25, 2147483647) = 1
> > > read(18, "LOOKUP\tpop3/192.168.31.10/tshield"..., 668) = 58
> > > write(18, "0\n"..., 2)                  = 2
> > > epoll_wait(12, {{EPOLLIN, {u32=19373072, u64=19373072}}}, 25, 2147483647) = 1
> > > read(7, "CONNECT\t3490\tpop3/192.168.31.10/t"..., 254) = 64
> > > epoll_wait(12, {{EPOLLIN, {u32=19373072, u64=19373072}}}, 25, 2147483647) = 1
> > > read(7, "DISCONNECT\t3482\tpop3/192.168.31.1"..., 190) = 62
> > > 
> > > Anything obvious here?  anvil talks over UNIX sockets to the rest of
> > > dovecot, and uses epoll_wait.  So, suspect commits might be:
> > > 
> > > 95aac7b1cd224f568fb83937044cd303ff11b029
> > > 5456f09aaf88731e16dbcea7522cb330b6846415
> > > or other bits from
> > >  git log v2.6.36..v2.6.37 net/unix/af_unix.c fs/eventpoll.c
> > > 
> > > I suspect it has something to do with that "infinite value" check removal
> > > in that first commit.  It doesn't show up easily on a test box, but I can
> > > try reverting 95aac7b1cd in production if it's not obvious.
> > > 
> > > Simon-
> > 
> > Yes, 95aac7b1cd is the problem, but anvil should use a 0 (no) timeout
> > instead of 2147483647 ms : epoll_wait() doesnt have to arm a timer in
> > this case, it is a bit faster.
> > 
> > 
> 
> Slowness comes from timespec_add_ns() : This one assumed small 'ns'
> argument, since it wants to avoid a divide instruction.
> 
> static __always_inline void timespec_add_ns(struct timespec *a, u64 ns)
> {
>         a->tv_sec += __iter_div_u64_rem(a->tv_nsec + ns, NSEC_PER_SEC, &ns);
>         a->tv_nsec = ns;
> }
> 
> We should do this differently for epoll usage ;)
> 
> Please try following patch :
> 
> [PATCH] epoll: epoll_wait() should be careful in timespec_add_ns use
> 
> commit 95aac7b1cd224f (epoll: make epoll_wait() use the hrtimer range
> feature) added a performance regression because it used
> timespec_add_ns() with potential very large 'ns' values.
> 
> Reported-by: Simon Kirby <sim@hostway.ca>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Shawn Bohrer <shawn.bohrer@gmail.com>
> CC: Davide Libenzi <davidel@xmailserver.org>
> CC: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/eventpoll.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index cc8a9b7..7ec0890 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1126,7 +1126,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  
>  	if (timeout > 0) {
>  		ktime_get_ts(&end_time);
> -		timespec_add_ns(&end_time, (u64)timeout * NSEC_PER_MSEC);
> +		end_time.tv_sec += timeout / MSEC_PER_SEC;
> +		timeout %= MSEC_PER_SEC;
> +		timespec_add_ns(&end_time, timeout * NSEC_PER_MSEC);
>  		slack = select_estimate_accuracy(&end_time);
>  		to = &expires;
>  		*to = timespec_to_ktime(end_time);

Yep, we can overflow the timeout, with the calculation above.
A timespec_add_ms()?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26 15:31     ` Davide Libenzi
@ 2011-01-26 15:43       ` Eric Dumazet
  2011-01-26 15:52         ` Davide Libenzi
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-01-26 15:43 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Simon Kirby, Linux Kernel Mailing List, Shawn Bohrer, Andrew Morton

Le mercredi 26 janvier 2011 à 07:31 -0800, Davide Libenzi a écrit :
> On Wed, 26 Jan 2011, Eric Dumazet wrote:
> 

> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index cc8a9b7..7ec0890 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -1126,7 +1126,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> >  
> >  	if (timeout > 0) {
> >  		ktime_get_ts(&end_time);
> > -		timespec_add_ns(&end_time, (u64)timeout * NSEC_PER_MSEC);
> > +		end_time.tv_sec += timeout / MSEC_PER_SEC;
> > +		timeout %= MSEC_PER_SEC;
> > +		timespec_add_ns(&end_time, timeout * NSEC_PER_MSEC);
> >  		slack = select_estimate_accuracy(&end_time);
> >  		to = &expires;
> >  		*to = timespec_to_ktime(end_time);
> 
> Yep, we can overflow the timeout, with the calculation above.
> A timespec_add_ms()?

Well, given timeout after modulo contains a number between 0 and 999,
multiply by 1.000.000 (NSEC_PER_MSEC) cant overflow.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26 15:43       ` Eric Dumazet
@ 2011-01-26 15:52         ` Davide Libenzi
  2011-01-26 15:59           ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Davide Libenzi @ 2011-01-26 15:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Simon Kirby, Linux Kernel Mailing List, Shawn Bohrer, Andrew Morton

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1285 bytes --]

On Wed, 26 Jan 2011, Eric Dumazet wrote:

> Le mercredi 26 janvier 2011 à 07:31 -0800, Davide Libenzi a écrit :
> > On Wed, 26 Jan 2011, Eric Dumazet wrote:
> > 
> 
> > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > > index cc8a9b7..7ec0890 100644
> > > --- a/fs/eventpoll.c
> > > +++ b/fs/eventpoll.c
> > > @@ -1126,7 +1126,9 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> > >  
> > >  	if (timeout > 0) {
> > >  		ktime_get_ts(&end_time);
> > > -		timespec_add_ns(&end_time, (u64)timeout * NSEC_PER_MSEC);
> > > +		end_time.tv_sec += timeout / MSEC_PER_SEC;
> > > +		timeout %= MSEC_PER_SEC;
> > > +		timespec_add_ns(&end_time, timeout * NSEC_PER_MSEC);
> > >  		slack = select_estimate_accuracy(&end_time);
> > >  		to = &expires;
> > >  		*to = timespec_to_ktime(end_time);
> > 
> > Yep, we can overflow the timeout, with the calculation above.
> > A timespec_add_ms()?
> 
> Well, given timeout after modulo contains a number between 0 and 999,
> multiply by 1.000.000 (NSEC_PER_MSEC) cant overflow.

For "above", I meant the current epoll expire time calculation, which was 
described above in the message ;)
The hint for a timespec_add_ms() was because we must be doing something 
similar in poll, don't we (/me got no code in front ATM)?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26 15:52         ` Davide Libenzi
@ 2011-01-26 15:59           ` Eric Dumazet
  2011-01-26 16:13             ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-01-26 15:59 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Simon Kirby, Linux Kernel Mailing List, Shawn Bohrer, Andrew Morton

Le mercredi 26 janvier 2011 à 07:52 -0800, Davide Libenzi a écrit :

> For "above", I meant the current epoll expire time calculation, which was 
> described above in the message ;)

Well, problem was not an overflow, but doing a loop 2.000.000 times ;)

> The hint for a timespec_add_ms() was because we must be doing something 
> similar in poll, don't we (/me got no code in front ATM)?

Apparently its done differently in poll(), using
poll_select_set_timeout() helper.


Give me some minutes I'll try to cook an alternate patch



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26 15:59           ` Eric Dumazet
@ 2011-01-26 16:13             ` Eric Dumazet
  2011-01-26 17:20               ` Shawn Bohrer
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-01-26 16:13 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Simon Kirby, Linux Kernel Mailing List, Shawn Bohrer,
	Andrew Morton, Thomas Gleixner

Le mercredi 26 janvier 2011 à 16:59 +0100, Eric Dumazet a écrit :
> Le mercredi 26 janvier 2011 à 07:52 -0800, Davide Libenzi a écrit :
> 
> > For "above", I meant the current epoll expire time calculation, which was 
> > described above in the message ;)
> 
> Well, problem was not an overflow, but doing a loop 2.000.000 times ;)
> 
> > The hint for a timespec_add_ms() was because we must be doing something 
> > similar in poll, don't we (/me got no code in front ATM)?
> 
> Apparently its done differently in poll(), using
> poll_select_set_timeout() helper.
> 
> 
> Give me some minutes I'll try to cook an alternate patch
> 

Here is the alternate patch, using poll_select_set_timeout() helper

Thanks

[PATCH v2] epoll: epoll_wait() should not use timespec_add_ns()

commit 95aac7b1cd224f (epoll: make epoll_wait() use the hrtimer range
feature) added a performance regression because it uses
timespec_add_ns() with potential very large 'ns' values.

Use poll_select_set_timeout() helper like poll()/select()

Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Shawn Bohrer <shawn.bohrer@gmail.com>
CC: Davide Libenzi <davidel@xmailserver.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 fs/eventpoll.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index cc8a9b7..94d887b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1125,8 +1125,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	ktime_t expires, *to = NULL;
 
 	if (timeout > 0) {
-		ktime_get_ts(&end_time);
-		timespec_add_ns(&end_time, (u64)timeout * NSEC_PER_MSEC);
+		poll_select_set_timeout(&end_time, timeout / MSEC_PER_SEC,
+				NSEC_PER_MSEC * (timeout % MSEC_PER_SEC));
 		slack = select_estimate_accuracy(&end_time);
 		to = &expires;
 		*to = timespec_to_ktime(end_time);



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26 16:13             ` Eric Dumazet
@ 2011-01-26 17:20               ` Shawn Bohrer
  2011-01-26 17:51                 ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Shawn Bohrer @ 2011-01-26 17:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Davide Libenzi, Simon Kirby, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner

On Wed, Jan 26, 2011 at 05:13:29PM +0100, Eric Dumazet wrote:
> Le mercredi 26 janvier 2011 à 16:59 +0100, Eric Dumazet a écrit :
> > Le mercredi 26 janvier 2011 à 07:52 -0800, Davide Libenzi a écrit :
> > 
> > > For "above", I meant the current epoll expire time calculation, which was 
> > > described above in the message ;)
> > 
> > Well, problem was not an overflow, but doing a loop 2.000.000 times ;)
> > 
> > > The hint for a timespec_add_ms() was because we must be doing something 
> > > similar in poll, don't we (/me got no code in front ATM)?
> > 
> > Apparently its done differently in poll(), using
> > poll_select_set_timeout() helper.
> > 
> > 
> > Give me some minutes I'll try to cook an alternate patch
> > 
> 
> Here is the alternate patch, using poll_select_set_timeout() helper
> 
> Thanks
> 
> [PATCH v2] epoll: epoll_wait() should not use timespec_add_ns()
> 
> commit 95aac7b1cd224f (epoll: make epoll_wait() use the hrtimer range
> feature) added a performance regression because it uses
> timespec_add_ns() with potential very large 'ns' values.
> 
> Use poll_select_set_timeout() helper like poll()/select()
> 
> Reported-by: Simon Kirby <sim@hostway.ca>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Shawn Bohrer <shawn.bohrer@gmail.com>
> CC: Davide Libenzi <davidel@xmailserver.org>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/eventpoll.c |    4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index cc8a9b7..94d887b 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1125,8 +1125,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  	ktime_t expires, *to = NULL;
>  
>  	if (timeout > 0) {
> -		ktime_get_ts(&end_time);
> -		timespec_add_ns(&end_time, (u64)timeout * NSEC_PER_MSEC);
> +		poll_select_set_timeout(&end_time, timeout / MSEC_PER_SEC,
> +				NSEC_PER_MSEC * (timeout % MSEC_PER_SEC));
>  		slack = select_estimate_accuracy(&end_time);
>  		to = &expires;
>  		*to = timespec_to_ktime(end_time);

poll_select_set_timeout() jumps through some extra hoops that
aren't necessary in the epoll case so I actually like your previous
patch better.

--
Shawn

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26 17:20               ` Shawn Bohrer
@ 2011-01-26 17:51                 ` Eric Dumazet
  2011-01-26 18:06                   ` Davide Libenzi
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-01-26 17:51 UTC (permalink / raw)
  To: Shawn Bohrer
  Cc: Davide Libenzi, Simon Kirby, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner

Le mercredi 26 janvier 2011 à 11:20 -0600, Shawn Bohrer a écrit :
> On Wed, Jan 26, 2011 at 05:13:29PM +0100, Eric Dumazet wrote:
> > Le mercredi 26 janvier 2011 à 16:59 +0100, Eric Dumazet a écrit :
> > > Le mercredi 26 janvier 2011 à 07:52 -0800, Davide Libenzi a écrit :
> > > 
> > > > For "above", I meant the current epoll expire time calculation, which was 
> > > > described above in the message ;)
> > > 
> > > Well, problem was not an overflow, but doing a loop 2.000.000 times ;)
> > > 
> > > > The hint for a timespec_add_ms() was because we must be doing something 
> > > > similar in poll, don't we (/me got no code in front ATM)?
> > > 
> > > Apparently its done differently in poll(), using
> > > poll_select_set_timeout() helper.
> > > 
> > > 
> > > Give me some minutes I'll try to cook an alternate patch
> > > 
> > 
> > Here is the alternate patch, using poll_select_set_timeout() helper
> > 
> > Thanks
> > 
> > [PATCH v2] epoll: epoll_wait() should not use timespec_add_ns()
> > 
> > commit 95aac7b1cd224f (epoll: make epoll_wait() use the hrtimer range
> > feature) added a performance regression because it uses
> > timespec_add_ns() with potential very large 'ns' values.
> > 
> > Use poll_select_set_timeout() helper like poll()/select()
> > 
> > Reported-by: Simon Kirby <sim@hostway.ca>
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > CC: Shawn Bohrer <shawn.bohrer@gmail.com>
> > CC: Davide Libenzi <davidel@xmailserver.org>
> > CC: Thomas Gleixner <tglx@linutronix.de>
> > CC: Andrew Morton <akpm@linux-foundation.org>
> > ---
> >  fs/eventpoll.c |    4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index cc8a9b7..94d887b 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -1125,8 +1125,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> >  	ktime_t expires, *to = NULL;
> >  
> >  	if (timeout > 0) {
> > -		ktime_get_ts(&end_time);
> > -		timespec_add_ns(&end_time, (u64)timeout * NSEC_PER_MSEC);
> > +		poll_select_set_timeout(&end_time, timeout / MSEC_PER_SEC,
> > +				NSEC_PER_MSEC * (timeout % MSEC_PER_SEC));
> >  		slack = select_estimate_accuracy(&end_time);
> >  		to = &expires;
> >  		*to = timespec_to_ktime(end_time);
> 
> poll_select_set_timeout() jumps through some extra hoops that
> aren't necessary in the epoll case so I actually like your previous
> patch better.

Well, I dont care, I let Davide decide, he is the boss ;)

This is a stable candidate, so adding timespec_add_ms() sounds overkill.




^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26 17:51                 ` Eric Dumazet
@ 2011-01-26 18:06                   ` Davide Libenzi
  2011-01-26 18:18                     ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Davide Libenzi @ 2011-01-26 18:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Shawn Bohrer, Simon Kirby, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner

On Wed, 26 Jan 2011, Eric Dumazet wrote:

> > > --- a/fs/eventpoll.c
> > > +++ b/fs/eventpoll.c
> > > @@ -1125,8 +1125,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
> > >  	ktime_t expires, *to = NULL;
> > >  
> > >  	if (timeout > 0) {
> > > -		ktime_get_ts(&end_time);
> > > -		timespec_add_ns(&end_time, (u64)timeout * NSEC_PER_MSEC);
> > > +		poll_select_set_timeout(&end_time, timeout / MSEC_PER_SEC,
> > > +				NSEC_PER_MSEC * (timeout % MSEC_PER_SEC));
> > >  		slack = select_estimate_accuracy(&end_time);
> > >  		to = &expires;
> > >  		*to = timespec_to_ktime(end_time);
> > 
> > poll_select_set_timeout() jumps through some extra hoops that
> > aren't necessary in the epoll case so I actually like your previous
> > patch better.
> 
> Well, I dont care, I let Davide decide, he is the boss ;)
> 
> This is a stable candidate, so adding timespec_add_ms() sounds overkill.

Eric, if you look at fs/select.c (~line 925), poll does exactly the same 
thing as epoll do.
It too, ignores the eventual return value of poll_select_set_timeout(), so 
maybe a little bit more optimized ktime_get_ts+timespec_add_ms could make 
sense.


- Davide



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26 18:06                   ` Davide Libenzi
@ 2011-01-26 18:18                     ` Eric Dumazet
  2011-01-26 18:45                       ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-01-26 18:18 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Shawn Bohrer, Simon Kirby, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner

Le mercredi 26 janvier 2011 à 10:06 -0800, Davide Libenzi a écrit :

> Eric, if you look at fs/select.c (~line 925), poll does exactly the same 
> thing as epoll do.
> It too, ignores the eventual return value of poll_select_set_timeout(), so 
> maybe a little bit more optimized ktime_get_ts+timespec_add_ms could make 
> sense.
> 

OK, I'll post a V3 ;)

Thanks



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26 18:18                     ` Eric Dumazet
@ 2011-01-26 18:45                       ` Eric Dumazet
  2011-01-28  0:17                         ` Davide Libenzi
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2011-01-26 18:45 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Shawn Bohrer, Simon Kirby, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner

Le mercredi 26 janvier 2011 à 19:18 +0100, Eric Dumazet a écrit :
> Le mercredi 26 janvier 2011 à 10:06 -0800, Davide Libenzi a écrit :
> 
> > Eric, if you look at fs/select.c (~line 925), poll does exactly the same 
> > thing as epoll do.
> > It too, ignores the eventual return value of poll_select_set_timeout(), so 
> > maybe a little bit more optimized ktime_get_ts+timespec_add_ms could make 
> > sense.
> > 
> 
> OK, I'll post a V3 ;)

Well in the poll() case we handle a zero timeout, not in epoll().

So the helper function cannot be shared and can be static to epoll.


[PATCH v3] epoll: epoll_wait() should not use timespec_add_ns()

commit 95aac7b1cd224f (epoll: make epoll_wait() use the hrtimer range
feature) added a performance regression because it uses
timespec_add_ns() with potential very large 'ns' values.

Reported-by: Simon Kirby <sim@hostway.ca>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Shawn Bohrer <shawn.bohrer@gmail.com>
CC: Davide Libenzi <davidel@xmailserver.org>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 fs/eventpoll.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index cc8a9b7..d517aa3 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1114,6 +1114,17 @@ static int ep_send_events(struct eventpoll *ep,
 	return ep_scan_ready_list(ep, ep_send_events_proc, &esed);
 }
 
+static inline struct timespec epoll_set_mstimeout(long ms)
+{
+	struct timespec now, ts = {
+		.tv_sec = ms / MSEC_PER_SEC,
+		.tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
+	};
+
+	ktime_get_ts(&now);
+	return timespec_add_safe(now, ts);
+}
+
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 		   int maxevents, long timeout)
 {
@@ -1121,12 +1132,11 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
 	unsigned long flags;
 	long slack;
 	wait_queue_t wait;
-	struct timespec end_time;
 	ktime_t expires, *to = NULL;
 
 	if (timeout > 0) {
-		ktime_get_ts(&end_time);
-		timespec_add_ns(&end_time, (u64)timeout * NSEC_PER_MSEC);
+		struct timespec end_time = epoll_set_mstimeout(timeout);
+
 		slack = select_estimate_accuracy(&end_time);
 		to = &expires;
 		*to = timespec_to_ktime(end_time);



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: sys_epoll_wait high CPU load in 2.6.37
  2011-01-26 18:45                       ` Eric Dumazet
@ 2011-01-28  0:17                         ` Davide Libenzi
  0 siblings, 0 replies; 14+ messages in thread
From: Davide Libenzi @ 2011-01-28  0:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Shawn Bohrer, Simon Kirby, Linux Kernel Mailing List,
	Andrew Morton, Thomas Gleixner

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2658 bytes --]

On Wed, 26 Jan 2011, Eric Dumazet wrote:

> Le mercredi 26 janvier 2011 à 19:18 +0100, Eric Dumazet a écrit :
> > Le mercredi 26 janvier 2011 à 10:06 -0800, Davide Libenzi a écrit :
> > 
> > > Eric, if you look at fs/select.c (~line 925), poll does exactly the same 
> > > thing as epoll do.
> > > It too, ignores the eventual return value of poll_select_set_timeout(), so 
> > > maybe a little bit more optimized ktime_get_ts+timespec_add_ms could make 
> > > sense.
> > > 
> > 
> > OK, I'll post a V3 ;)
> 
> Well in the poll() case we handle a zero timeout, not in epoll().
> 
> So the helper function cannot be shared and can be static to epoll.
> 
> 
> [PATCH v3] epoll: epoll_wait() should not use timespec_add_ns()
> 
> commit 95aac7b1cd224f (epoll: make epoll_wait() use the hrtimer range
> feature) added a performance regression because it uses
> timespec_add_ns() with potential very large 'ns' values.
> 
> Reported-by: Simon Kirby <sim@hostway.ca>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Shawn Bohrer <shawn.bohrer@gmail.com>
> CC: Davide Libenzi <davidel@xmailserver.org>
> CC: Thomas Gleixner <tglx@linutronix.de>
> CC: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/eventpoll.c |   16 +++++++++++++---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index cc8a9b7..d517aa3 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -1114,6 +1114,17 @@ static int ep_send_events(struct eventpoll *ep,
>  	return ep_scan_ready_list(ep, ep_send_events_proc, &esed);
>  }
>  
> +static inline struct timespec epoll_set_mstimeout(long ms)
> +{
> +	struct timespec now, ts = {
> +		.tv_sec = ms / MSEC_PER_SEC,
> +		.tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
> +	};
> +
> +	ktime_get_ts(&now);
> +	return timespec_add_safe(now, ts);
> +}
> +
>  static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  		   int maxevents, long timeout)
>  {
> @@ -1121,12 +1132,11 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
>  	unsigned long flags;
>  	long slack;
>  	wait_queue_t wait;
> -	struct timespec end_time;
>  	ktime_t expires, *to = NULL;
>  
>  	if (timeout > 0) {
> -		ktime_get_ts(&end_time);
> -		timespec_add_ns(&end_time, (u64)timeout * NSEC_PER_MSEC);
> +		struct timespec end_time = epoll_set_mstimeout(timeout);
> +
>  		slack = select_estimate_accuracy(&end_time);
>  		to = &expires;
>  		*to = timespec_to_ktime(end_time);

Looks OK for me. The epoll_ prefix fights with the ep_ standard on the 
original file, but I will post a one-liner for it, given Andrew already 
got this.


- Davide


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-01-28  0:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-26  0:09 sys_epoll_wait high CPU load in 2.6.37 Simon Kirby
2011-01-26  7:18 ` Eric Dumazet
2011-01-26 11:16   ` Eric Dumazet
2011-01-26 15:31     ` Davide Libenzi
2011-01-26 15:43       ` Eric Dumazet
2011-01-26 15:52         ` Davide Libenzi
2011-01-26 15:59           ` Eric Dumazet
2011-01-26 16:13             ` Eric Dumazet
2011-01-26 17:20               ` Shawn Bohrer
2011-01-26 17:51                 ` Eric Dumazet
2011-01-26 18:06                   ` Davide Libenzi
2011-01-26 18:18                     ` Eric Dumazet
2011-01-26 18:45                       ` Eric Dumazet
2011-01-28  0:17                         ` Davide Libenzi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.