From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2B04C43381 for ; Mon, 25 Mar 2019 17:38:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id A45772087E for ; Mon, 25 Mar 2019 17:38:23 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=akamai.com header.i=@akamai.com header.b="bPkKi5lI" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729714AbfCYRiW (ORCPT ); Mon, 25 Mar 2019 13:38:22 -0400 Received: from mx0b-00190b01.pphosted.com ([67.231.157.127]:49680 "EHLO mx0b-00190b01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725788AbfCYRiW (ORCPT ); Mon, 25 Mar 2019 13:38:22 -0400 X-Greylist: delayed 2618 seconds by postgrey-1.27 at vger.kernel.org; Mon, 25 Mar 2019 13:38:21 EDT Received: from pps.filterd (m0050096.ppops.net [127.0.0.1]) by m0050096.ppops.net-00190b01. (8.16.0.27/8.16.0.27) with SMTP id x2PGWjrc000449; Mon, 25 Mar 2019 16:54:42 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=akamai.com; h=subject : to : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=jan2016.eng; bh=wqVeYLsaCbBm+nzUQTbrPZybwO072HkqhWEUcditD1c=; b=bPkKi5lIepexvHHSRziXogNQmbNVwcU2yx1Brpgo8D2hD16ee/u8dvRFPxtrEuXZjj/4 KxuoDZ0DFWd3ILeEekb2YGFIRqEuKhu+fa9aZeaZCpkwNIw6KYwwOcM5VjDtxplYBffI j5B/NYukiOEh34DcU1zb0wsR6MA49GqEwWsIb0hgRhnYDbYmuWzauL+1KcI7sW+RXsu/ Wr6X32zzmapVJnm3BsIXpEolaVFtxlZsdJ+aMjfI0gwLl0UTTbqZXkCOIzkB6WkJMKlD RtlpiZdpRwQ5/38w59UJsAs/1mE5cXErXzlTkblE3oLEM9onF8yGlgmmX7oBywdkzA90 XQ== Received: from prod-mail-ppoint2 (prod-mail-ppoint2.akamai.com [184.51.33.19]) by m0050096.ppops.net-00190b01. with ESMTP id 2rde380dj0-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 25 Mar 2019 16:54:42 +0000 Received: from pps.filterd (prod-mail-ppoint2.akamai.com [127.0.0.1]) by prod-mail-ppoint2.akamai.com (8.16.0.27/8.16.0.27) with SMTP id x2PGX1mE024468; Mon, 25 Mar 2019 12:54:40 -0400 Received: from prod-mail-relay11.akamai.com ([172.27.118.250]) by prod-mail-ppoint2.akamai.com with ESMTP id 2rdg4w0kme-1; Mon, 25 Mar 2019 12:54:39 -0400 Received: from [172.29.170.83] (bos-lpjec.kendall.corp.akamai.com [172.29.170.83]) by prod-mail-relay11.akamai.com (Postfix) with ESMTP id 4818C1FC76; Mon, 25 Mar 2019 16:54:36 +0000 (GMT) Subject: Re: Resurrecting EPOLLROUNDROBIN To: Marek Majkowski , linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org References: From: Jason Baron Openpgp: preference=signencrypt Autocrypt: addr=jbaron@akamai.com; prefer-encrypt=mutual; keydata= xsFNBFnyIJMBEADamFSO/WCelO/HZTSNbJ1YU9uoEUwmypV2TvyrTrXULcAlH1sXVHS3pNdR I/koZ1V7Ruew5HJC4K9Z5Fuw/RHYWcnQz2X+dSL6rX3BwRZEngjA4r/GDi0EqIdQeQQWCAgT VLWnIenNgmEDCoFQjFny5NMNL+i8SA6hPPRdNjxDowDhbFnkuVUBp1DBqPjHpXMzf3UYsZZx rxNY5YKFNLCpQb1cZNsR2KXZYDKUVALN3jvjPYReWkqRptOSQnvfErikwXRgCTasWtowZ4cu hJFSM5Asr/WN9Wy6oPYObI4yw+KiiWxiAQrfiQVe7fwznStaYxZ2gZmlSPG/Y2/PyoCWYbNZ mJ/7TyED5MTt22R7dqcmrvko0LIpctZqHBrWnLTBtFXZPSne49qGbjzzHywZ0OqZy9nqdUFA ZH+DALipwVFnErjEjFFRiwCWdBNpIgRrHd2bomlyB5ZPiavoHprgsV5ZJNal6fYvvgCik77u 6QgE4MWfhf3i9A8Dtyf8EKQ62AXQt4DQ0BRwhcOW5qEXIcKj33YplyHX2rdOrD8J07graX2Q 2VsRedNiRnOgcTx5Zl3KARHSHEozpHqh7SsthoP2yVo4A3G2DYOwirLcYSCwcrHe9pUEDhWF bxdyyESSm/ysAVjvENsdcreWJqafZTlfdOCE+S5fvC7BGgZu7QARAQABzR9KYXNvbiBCYXJv biA8amJhcm9uQGFrYW1haS5jb20+wsF+BBMBAgAoBQJZ8iCTAhsDBQkJZgGABgsJCAcDAgYV CAIJCgsEFgIDAQIeAQIXgAAKCRC4s7mct4u0M9E0EADBxyL30W9HnVs3x7umqUbl+uBqbBIS GIvRdMDIJXX+EEA6c82ElV2cCOS7dvE3ssG1jRR7g3omW7qEeLdy/iQiJ/qGNdcf0JWHYpmS ThZP3etrl5n7FwLm+51GPqD0046HUdoVshRs10qERDo+qnvMtTdXsfk8uoQ5lyTSvgX4s1H1 ppN1BfkG10epsAtjOJJlBoV9e92vnVRIUTnDeTVXfK11+hT5hjBxxs7uS46wVbwPuPjMlbSa ifLnt7Jz590rtzkeGrUoM5SKRL4DVZYNoAVFp/ik1fe53Wr5GJZEgDC3SNGS/u+IEzEGCytj gejvv6KDs3KcTVSp9oJ4EIZRmX6amG3dksXa4W2GEQJfPfV5+/FR8IOg42pz9RpcET32AL1n GxWzY4FokZB0G6eJ4h53DNx39/zaGX1i0cH+EkyZpfgvFlBWkS58JRFrgY25qhPZiySRLe0R TkUcQdqdK77XDJN5zmUP5xJgF488dGKy58DcTmLoaBTwuCnX2OF+xFS4bCHJy93CluyudOKs e4CUCWaZ2SsrMRuAepypdnuYf3DjP4DpEwBeLznqih4hMv5/4E/jMy1ZMdT+Q8Qz/9pjEuVF Yz2AXF83Fqi45ILNlwRjCjdmG9oJRJ+Yusn3A8EbCtsi2g443dKBzhFcmdA28m6MN9RPNAVS ucz3Oc7BTQRZ8iCTARAA2uvxdOFjeuOIpayvoMDFJ0v94y4xYdYGdtiaqnrv01eOac8msBKy 4WRNQ2vZeoilcrPxLf2eRAfsA4dx8Q8kOPvVqDc8UX6ttlHcnwxkH2X4XpJJliA6jx29kBOc oQOeL9R8c3CWL36dYbosZZwHwY5Jjs7R6TJHx1FlF9mOGIPxIx3B5SuJLsm+/WPZW1td7hS0 Alt4Yp8XWW8a/X765g3OikdmvnJryTo1s7bojmwBCtu1TvT0NrX5AJId4fELlCTFSjr+J3Up MnmkTSyovPkj8KcvBU1JWVvMnkieqrhHOmf2qdNMm61LGNG8VZQBVDMRg2szB79p54DyD+qb gTi8yb0MFqNvXGRnU/TZmLlxblHA4YLMAuLlJ3Y8Qlw5fJ7F2U1Xh6Z6m6YCajtsIF1VkUhI G2dSAigYpe6wU71Faq1KHp9C9VsxlnSR1rc4JOdj9pMoppzkjCphyX3eV9eRcfm4TItTNTGJ 7DAUQHYS3BVy1fwyuSDIJU/Jrg7WWCEzZkS4sNcBz0/GajYFM7Swybn/VTLtCiioThw4OQIw 9Afb+3sB9WR86B7N7sSUTvUArknkNDFefTJJLMzEboRMJBWzpR5OAyLxCWwVSQtPp0IdiIC2 KGF3QXccv/Q9UkI38mWvkilr3EWAOJnPgGCM/521axcyWqXsqNtIxpUAEQEAAcLBZQQYAQIA DwUCWfIgkwIbDAUJCWYBgAAKCRC4s7mct4u0M+AsD/47Q9Gi+HmLyqmaaLBzuI3mmU4vDn+f 50A/U9GSVTU/sAN83i1knpv1lmfG2DgjLXslU+NUnzwFMLI3QsXD3Xx/hmdGQnZi9oNpTMVp tG5hE6EBPsT0BM6NGbghBsymc827LhfYICiahOR/iv2yv6nucKGBM51C3A15P8JgfJcngEnM fCKRuQKWbRDPC9dEK9EBglUYoNPVNL7AWJWKAbVQyCCsJzLBgh9jIfmZ9GClu8Sxi0vu/PpA DSDSJuc9wk+m5mczzzwd4Y6ly9+iyk/CLNtqjT4sRMMV0TCl8ichxlrdt9rqltk22HXRF7ng txomp7T/zRJAqhH/EXWI6CXJPp4wpMUjEUd1B2+s1xKypq//tChF+HfUU4zXUyEXY8nHl6lk hFjW/geTcf6+i6mKaxGY4oxuIjF1s2Ak4J3viSeYfTDBH/fgUzOGI5siBhHWvtVzhQKHfOxg i8t1q09MJY6je8l8DLEIWTHXXDGnk+ndPG3foBucukRqoTv6AOY49zjrt6r++sujjkE4ax8i ClKvS0n+XyZUpHFwvwjSKc+UV1Q22BxyH4jRd1paCrYYurjNG5guGcDDa51jIz69rj6Q/4S9 Pizgg49wQXuci1kcC1YKjV2nqPC4ybeT6z/EuYTGPETKaegxN46vRVoE2RXwlVk+vmadVJlG JeQ7iQ== Message-ID: Date: Mon, 25 Mar 2019 12:54:40 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-03-25_09:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1903250122 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-03-25_09:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1903250122 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On 3/25/19 7:38 AM, Marek Majkowski wrote: > Hi, > > Recently we noticed epoll is not helpful for load balancing when > called on a listen TCP socket. I described this in a blog post: > > https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ > > The short explanation: new connections going to a listen socket are > not evenly distributed across processes that wait on the EPOLLIN. In > practice the last process doing epoll_wait() will get the new > connection. See the trivial program to reproduce: > > https://github.com/cloudflare/cloudflare-blog/blob/master/2017-10-accept-balancing/epoll-and-accept.py > > $ ./epoll-and-accept.py & > $ for i in `seq 6`; do echo | nc localhost 1024; done > worker 0 > worker 0 > worker 0 > worker 0 > worker 0 > worker 0 > > Worker #0 did all the accept() calls. This is because the listen > socket wait queue is a LIFO (not FIFO!). With current behaviour, the > process calling epoll_wait() most recently will be woken up first. > This usually is the busiest process. This leads to uneven load > distribution across worker processes. > > Notice, described problem is different from what EPOLLEXCLUSIVE tries > to solve. Exclusive flag is about waking up exactly one process, as > opposed to default behaviour of waking up all the subscribers > (thundering herd problem). Without EPOLLEXCLUSIVE the described > load-balancing problem is less prominent, since there is an inherent > race when all the woken up processes fight for the new connection. In > such case the other workers have some chance of getting the new > connection. The core problem still is there - accept calls are not > well balanced across waiting processes. > > On a loaded server avoiding EPOLLEXCLUSIVE is wasteful. With high > number of new connections, and dozens of worker processes, waking up > everybody on every new connection is suboptimal. > > Notice, that multiple threads doing blocking accept() have a proper > FIFO behaviour. In other words: you can achieve round-robin load > balancing by having multiple workers hang on accept(), while you can't > have that behaviour when waiting in epoll_wait(). > > We are using EPOLLEXCLUSIVE, and as a solution to load-balancing > problem we backported the EPOLLROUNDROBIN patch submitted by Jason > Byron in 2015. We are running this patch for last 6 months, and it > helped us to flatten the load across workers (and reduce tail > latency). > > https://lists.openwall.net/linux-kernel/2015/02/17/723 > > (PS. generally speaking EPOLLROUNDROBIN makes no sense in conjunction > with SO_REUSEPORT sockets) > > Jason, would you mind to resubmit it? > > Cheers, > Marek > Hi Marek, So I think there may have been a couple issues last time. First, I wasn't convinced if anybody actually wanted this. Sounds like there is interest now. Second, was that it touched some of the core wakeup bits and although I don't think any of the scheduler maintainers objected, I didn't want to change the core wakeup code for this epoll only feature. I think I can probably register a generic wakeup with the core code and then have only the epoll code be aware of the round robin behavior. So I will cook up a patch like that and re-post. Thanks, -Jason