Re: [PATCH] epoll: avoid calling ep_call_nested() from ep_poll_safewake()

From: Hou Tao <houtao1@huawei.com>
To: Jason Baron <jbaron@akamai.com>, Davidlohr Bueso <dave@stgolabs.net>
Cc: <akpm@linux-foundation.org>, <linux-kernel@vger.kernel.org>,
	<linux-fsdevel@vger.kernel.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Salman Qazi <sqazi@google.com>
Subject: Re: [PATCH] epoll: avoid calling ep_call_nested() from ep_poll_safewake()
Date: Thu, 18 Jan 2018 19:00:38 +0800	[thread overview]
Message-ID: <e49485bd-a66e-5843-4b77-3801c3b70d54@huawei.com> (raw)
In-Reply-To: <05416148-bf19-7718-b6b4-7658b5aca2ee@akamai.com>

Hi Jason,

On 2017/10/18 22:03, Jason Baron wrote:
> 
> 
> On 10/17/2017 11:37 AM, Davidlohr Bueso wrote:
>> On Fri, 13 Oct 2017, Jason Baron wrote:
>>
>>> The ep_poll_safewake() function is used to wakeup potentially nested
>>> epoll
>>> file descriptors. The function uses ep_call_nested() to prevent entering
>>> the same wake up queue more than once, and to prevent excessively deep
>>> wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary
>>> since we are already preventing these conditions during EPOLL_CTL_ADD.
>>> This
>>> saves extra function calls, and avoids taking a global lock during the
>>> ep_call_nested() calls.
>>
>> This makes sense.
>>
>>>
>>> I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
>>> case, since ep_call_nested() keeps track of the nesting level, and
>>> this is
>>> required by the call to spin_lock_irqsave_nested(). It would be nice to
>>> remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as
>>> well, however its not clear how to simply pass the nesting level through
>>> multiple wake_up() levels without more surgery. In any case, I don't
>>> think
>>> CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch,
>>> also
>>> apparently fixes a workload at Google that Salman Qazi reported by
>>> completely removing the poll_safewake_ncalls->lock from wakeup paths.
>>
>> I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as
>> I was tackling the nested epoll scaling issue with loop and readywalk lists
>> in mind.
>>>
> 
> I'm not sure the details of the workload - perhaps Salman can elaborate
> further about it.
> 
> It would seem that the safewake would potentially be the most contended
> in general in the nested case, because generally you have a few epoll
> fds attached to lots of sources doing wakeups. So those sources are all
> going to conflict on the safewake lock. The readywalk is used when
> performing a 'nested' poll and in general this is likely going to be
> called on a few epoll fds. That said, we should remove it too. I will
> post a patch to remove it.
> 
> The loop lock is used during EPOLL_CTL_ADD to check for loops and deep
> wakeup paths and so I would expect this to be less common, but I
> wouldn't doubt there are workloads impacted by it. We can potentially, I
> think remove this one too - and the global 'epmutex'. I posted some
> ideas a while ago on it:
> 
> http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html
> 
> We can work through these ideas or others...

I have tested these patches on a simple non-nested epoll workload and found that
there are performance regression (-0.5% which is better than -1.0% from my patch set)
under normal nginx conf and performance improvement (+3%, and the number of my
patch set is +4%) under the one-fd-per-request conf. The reason for performance
improvement is clear, however, I haven't figured out the reason for the performance
regression (maybe cache related ?)

It seems that your patch set will work fine both under normal and nested epoll workload,
so I don't plan to continue my patch set which only works for normal epoll workload.
I have one question about your patch set. It is about the newly added fields in
struct file. I had tried to move these fields into struct eventpoll, so only epoll fds
will be added into a disjoint set and the target fd will be linked to a disjoint set
dynamically, but that method doesn't work out because there are multiple target fds and
the order in which these target fds are added to a disjoint set will lead to deadlock.
So do you have other ideas to reduce the size of these fields ?

Thanks,
Tao

The following lines are the result of performance result:

baseline: result for v4.14
ep_cc: result for v4.14 + ep_cc patch set
my: result for  v4.14 + my patch set

The first 15 columns are the results from wrk and their unit is Requests/sec,
the lats two columns are their average value and standard derivation.

* normal scenario: nginx + wrk

baseline
376704.74 377223.88 378096.67 379484.45 379199.04 379526.21 378008.82 \
379959.98 377634.47 379127.27 377622.92 379442.12 378994.44 376000.08 377046.58 \
378271.4447 1210.673592

ep_cc
373376.08 376897.65 373493.65 376154.44 377374.36 379080.88 375124.44 \
376404.16 376539.28 375141.84 379075.32 377214.39 374524.52 375605.87 372167.8 \
375878.312 1983.190539

my
373067.96 372369 375317.61 375564.66 373841.88 373699.88 376802.91 369988.97 \
376080.45 371580.64 374265.34 376370.34 377033.75 372786.91 375521.61 \
374286.1273 2059.92093

* one-fd-per-request scenario: nginx + wrk
baseline
124509.77 106542.46 118074.46 111434.02 117628.9 108260.56 119037.91 \
114413.75 108706.82 116277.99 111588.4 118216.56 121267.63 110950.4 116455.43 \
114891.004 5168.30288

ep_cc
124209.45 118297.8 120258.1 122785.95 118744.64 118471.76 117621.07 \
117525.14 114797.14 115348.47 117154.02 117208.37 118618.92 118578.8 118945.52 \
118571.01 2437.050182

my
125090.99 121408.92 116444.09 120055.4 119399.24 114938.04 122889.56 \
114363.49 120934.15 119052.03 118060.47 129976.45 121108.08 114849.93 114810.64 \
119558.7653 4338.18401

> Thanks,
> 
> -Jason
> 
> 
>>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Salman Qazi <sqazi@google.com>
>>
>> Acked-by: Davidlohr Bueso <dbueso@suse.de>
>>
>>> ---
>>> fs/eventpoll.c | 47 ++++++++++++++++++-----------------------------
>>> 1 file changed, 18 insertions(+), 29 deletions(-)
>>
>> Yay for getting rid of some of the callback hell.
>>
>> Thanks,
>> Davidlohr
> 
> .
>