From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751889AbdJROJJ (ORCPT ); Wed, 18 Oct 2017 10:09:09 -0400 Received: from mx0a-00190b01.pphosted.com ([67.231.149.131]:34295 "EHLO mx0a-00190b01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751136AbdJROJE (ORCPT ); Wed, 18 Oct 2017 10:09:04 -0400 Subject: Re: [PATCH v7 7/6] fs/epoll: scale nested callbacks To: Davidlohr Bueso Cc: Waiman Long , akpm@linux-foundation.org, Alexander Viro , Jan Kara , Jeff Layton , "J. Bruce Fields" , Tejun Heo , Christoph Lameter , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Ingo Molnar , Peter Zijlstra , Andi Kleen , Dave Chinner , Boqun Feng References: <1507229008-20569-1-git-send-email-longman@redhat.com> <20171013154543.GI5131@linux-80c1.suse> <5c15251d-0034-dc2a-5019-3e8f0d60c194@akamai.com> <20171017155318.5rcoprfwxs27k2qq@linux-n805> From: Jason Baron Message-ID: <77a2bec3-f663-40de-9ad8-949bf3230a76@akamai.com> Date: Wed, 18 Oct 2017 10:06:52 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0 MIME-Version: 1.0 In-Reply-To: <20171017155318.5rcoprfwxs27k2qq@linux-n805> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-10-18_05:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 suspectscore=2 malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1710180198 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-10-18_05:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1710180197 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/17/2017 11:53 AM, Davidlohr Bueso wrote: > On Mon, 16 Oct 2017, Jason Baron wrote: > >> Hi, >> >> I posted a patch to completely remove the lock here from the >> ep_poll_safewake() patch here: >> >> http://lkml.iu.edu/hypermail/linux/kernel/1710.1/05834.html > > Kernel development never ceases to amaze me. Two complementary > fixes to a 8+ y/o performance issue on the same day - not that > nested epolls are that common, but it also comes from two real > workloads... > > Getting rid of the lock altogether is always the best way. > >> >> So these are going to conflict. The reason its safe to remove the lock >> is that there are loop and depth checks now that are performed during >> EPOLL_CTL_ADD. Specifically, ep_loop_check(). I would prefer to these >> checks once add add time as opposed to at each wakeup (even if they can >> be scaled better). > > Wrt conflicts, no worries, I'll rebase -- and hopefully we can get > the dlock stuff in for v4.15 as well. > >> >> I also have a pending patch to do something similar for >> poll_readywalk_ncalls, but I think that poll_safewake_ncalls is the most >> egregious one here? > > The customer's workload issues are for the loop_ncalls and readywalk_ncalls > lists, so I'd be interested in seeing your patch for the later. The reason > your patch above is likely not to help my scenario is that most of the time > is spent at a dispatcher level doing epoll_wait, not too many > EPOLL_CTL_ADDs > going on afaict. If there are not many EPOLL_CTL_ADDs, then I wouldn't think loop_ncalls would be highly contented (since it should only be called from the add path)? Thanks, -Jason > > Thanks, > Davidlohr > >> >> Thanks, >> >> -Jason >> >> On 10/13/2017 11:45 AM, Davidlohr Bueso wrote: >>> A customer reported massive contention on the ncalls->lock in which >>> the workload is designed around nested epolls (where the fd is already >>> an epoll). >>> >>> 83.49%  [kernel]               [k] __pv_queued_spin_lock_slowpath >>>  2.70%  [kernel]               [k] ep_call_nested.constprop.13 >>>  1.94%  [kernel]               [k] _raw_spin_lock_irqsave >>>  1.83%  [kernel]               [k] >>> __raw_callee_save___pv_queued_spin_unlock >>>  1.45%  [kernel]               [k] _raw_spin_unlock_irqrestore >>>  0.41%  [kernel]               [k] ep_scan_ready_list.isra.8 >>>  0.36%  [kernel]               [k] pvclock_clocksource_read >>> >>> The application running on kvm, is using a shared memory IPC >>> communication >>> with a pipe wakeup mechanism, and a two level dispatcher both built >>> around >>> 'epoll_wait'. There is one process per physical core and a full mesh of >>> pipes >>> between them, so on a 18 core system (the actual case), there are 18*18 >>> pipes >>> for the IPCs alone. >>> >>> This patch proposes replacing the nested calls global linked list with a >>> dlock >>> interface, for which we can benefit from pcpu lists when doing >>> ep_poll_safewake(), >>> and hashing for the current task, which provides two benefits: >>> >>> 1. Speeds up the process of loop and max-depth checks from O(N) lookups >>> to O(1) >>>   (albeit possible collisions, which we have to iterate); and, >>> >>> 2. Provides smaller lock granularity. >>> >>> cpus    before        after       diff >>> 1    1409370        1344804     -4.58% >>> 2    1015690        1337053     31.63% >>> 3     721009        1273254     76.59% >>> 4     380959        1128931    196.33% >>> 5     287603        1028362    257.56% >>> 6     221104         894943    304.76% >>> 7     173592         976395    462.46% >>> 8     145860         922584    532.51% >>> 9     127877         925900    624.05% >>> 10     112603         791456    602.87% >>> 11      97926         724296    639.63% >>> 12      80732         730485    804.82% >>> >>> With the exception of a single cpu, where the overhead of jhashing >>> influences), we >>> get some pretty good raw throughput increase. Similarly, the amount of >>> time spent >>> decreases immensely as well. >>> >>> Passes ltp related testcases. >>> >>> Signed-off-by: Davidlohr Bueso >>> --- >>> fs/eventpoll.c | 88 >>> +++++++++++++++++++++++++++++++++++----------------------- >>> 1 file changed, 53 insertions(+), 35 deletions(-) >>> >>> diff --git a/fs/eventpoll.c b/fs/eventpoll.c >>> index 2fabd19cdeea..675c97fdc5da 100644 >>> --- a/fs/eventpoll.c >>> +++ b/fs/eventpoll.c >>> @@ -22,7 +22,7 @@ >>> #include >>> #include >>> #include >>> -#include >>> +#include >>> #include >>> #include >>> #include >>> @@ -119,7 +119,7 @@ struct epoll_filefd { >>>  * and loop cycles. >>>  */ >>> struct nested_call_node { >>> -    struct list_head llink; >>> +    struct dlock_list_node llink; >>>     void *cookie; >>>     void *ctx; >>> }; >>> @@ -129,8 +129,7 @@ struct nested_call_node { >>>  * maximum recursion dept and loop cycles. >>>  */ >>> struct nested_calls { >>> -    struct list_head tasks_call_list; >>> -    spinlock_t lock; >>> +    struct dlock_list_heads tasks_call_list; >>> }; >>> >>> /* >>> @@ -371,10 +370,14 @@ static inline int ep_op_has_event(int op) >>> } >>> >>> /* Initialize the poll safe wake up structure */ >>> -static void ep_nested_calls_init(struct nested_calls *ncalls) >>> +static inline int ep_nested_calls_init(struct nested_calls *ncalls) >>> +{ >>> +    return alloc_dlock_list_heads(&ncalls->tasks_call_list, true); >>> +} >>> + >>> +static inline void ep_nested_calls_free(struct nested_calls *ncalls) >>> { >>> -    INIT_LIST_HEAD(&ncalls->tasks_call_list); >>> -    spin_lock_init(&ncalls->lock); >>> +    free_dlock_list_heads(&ncalls->tasks_call_list); >>> } >>> >>> /** >>> @@ -483,45 +486,50 @@ static int ep_call_nested(struct nested_calls >>> *ncalls, int max_nests, >>> { >>>     int error, call_nests = 0; >>>     unsigned long flags; >>> -    struct list_head *lsthead = &ncalls->tasks_call_list; >>> -    struct nested_call_node *tncur; >>> -    struct nested_call_node tnode; >>> +    struct dlock_list_head *head; >>> +    struct nested_call_node *tncur, tnode; >>> >>> -    spin_lock_irqsave(&ncalls->lock, flags); >>> +    head = dlock_list_hash(&ncalls->tasks_call_list, ctx); >>> >>> +    spin_lock_irqsave(&head->lock, flags); >>>     /* >>>      * Try to see if the current task is already inside this wakeup >>> call. >>> -     * We use a list here, since the population inside this set is >>> always >>> -     * very much limited. >>> +     * >>> +     * We use a chained hash table with linked lists here, and take the >>> +     * lock to avoid racing when collisions (where ctx pointer is the >>> key). >>> +     * Calls for which context is the cpu number, avoid hashing and >>> act as >>> +     * pcpu add/removal. >>> +     * >>> +     * Since the population inside this set is always very much >>> limited, >>> +     * list scanning should be short. >>>      */ >>> -    list_for_each_entry(tncur, lsthead, llink) { >>> -        if (tncur->ctx == ctx && >>> -            (tncur->cookie == cookie || ++call_nests > max_nests)) { >>> -            /* >>> -             * Ops ... loop detected or maximum nest level reached. >>> -             * We abort this wake by breaking the cycle itself. >>> -             */ >>> -            error = -1; >>> -            goto out_unlock; >>> -        } >>> -    } >>> +    list_for_each_entry(tncur, &head->list, llink.list) { >>> +           if (tncur->ctx == ctx && >>> +           (tncur->cookie == cookie || ++call_nests > EP_MAX_NESTS)) { >>> +               /* >>> +            * Ops ... loop detected or maximum nest level reached. >>> +            * We abort this wake by breaking the cycle itself. >>> +            */ >>> +               error = -1; >>> +               spin_unlock_irqrestore(&head->lock, flags); >>> +               goto done; >>> +           } >>> +       } >>> + >>> >>>     /* Add the current task and cookie to the list */ >>>     tnode.ctx = ctx; >>>     tnode.cookie = cookie; >>> -    list_add(&tnode.llink, lsthead); >>> - >>> -    spin_unlock_irqrestore(&ncalls->lock, flags); >>> +    tnode.llink.head = head; >>> +    list_add(&tnode.llink.list, &head->list); >>> +    spin_unlock_irqrestore(&head->lock, flags); >>> >>>     /* Call the nested function */ >>>     error = (*nproc)(priv, cookie, call_nests); >>> >>>     /* Remove the current task from the list */ >>> -    spin_lock_irqsave(&ncalls->lock, flags); >>> -    list_del(&tnode.llink); >>> -out_unlock: >>> -    spin_unlock_irqrestore(&ncalls->lock, flags); >>> - >>> +    dlock_lists_del(&tnode.llink); >>> +done: >>>     return error; >>> } >>> >>> @@ -2313,13 +2321,16 @@ static int __init eventpoll_init(void) >>>      * Initialize the structure used to perform epoll file descriptor >>>      * inclusion loops checks. >>>      */ >>> -    ep_nested_calls_init(&poll_loop_ncalls); >>> +    if (ep_nested_calls_init(&poll_loop_ncalls)) >>> +        goto err; >>> >>>     /* Initialize the structure used to perform safe poll wait head wake >>> ups */ >>> -    ep_nested_calls_init(&poll_safewake_ncalls); >>> +    if (ep_nested_calls_init(&poll_safewake_ncalls)) >>> +        goto err_free1; >>> >>>     /* Initialize the structure used to perform file's f_op->poll() >>> calls */ >>> -    ep_nested_calls_init(&poll_readywalk_ncalls); >>> +    if (ep_nested_calls_init(&poll_readywalk_ncalls)) >>> +        goto err_free0; >>> >>>     /* >>>      * We can have many thousands of epitems, so prevent this from >>> @@ -2336,5 +2347,12 @@ static int __init eventpoll_init(void) >>>             sizeof(struct eppoll_entry), 0, SLAB_PANIC, NULL); >>> >>>     return 0; >>> + >>> +err_free0: >>> +    ep_nested_calls_free(&poll_safewake_ncalls); >>> +err_free1: >>> +    ep_nested_calls_free(&poll_loop_ncalls); >>> +err: >>> +    return -ENOMEM; >>> } >>> fs_initcall(eventpoll_init);