From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6D6C3C43381 for ; Tue, 26 Mar 2019 15:00:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 26DF02075C for ; Tue, 26 Mar 2019 15:00:40 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=akamai.com header.i=@akamai.com header.b="mFZKDvEf" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731611AbfCZPAj (ORCPT ); Tue, 26 Mar 2019 11:00:39 -0400 Received: from mx0b-00190b01.pphosted.com ([67.231.157.127]:50680 "EHLO mx0b-00190b01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726278AbfCZPAi (ORCPT ); Tue, 26 Mar 2019 11:00:38 -0400 Received: from pps.filterd (m0122330.ppops.net [127.0.0.1]) by mx0b-00190b01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x2QEv5Lg020594; Tue, 26 Mar 2019 15:00:35 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=akamai.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=jan2016.eng; bh=p4wEAX3gmm9VUHgt+RU5msWA8JRSFI2di0ZluGtRViY=; b=mFZKDvEfYdF52PDxdO0SS593L0//fPhzhwe/pPJNxAwSyDc1HYby7V42vvfydPOpjtki cilRVq0ShQwdETE0dQed2ooLVDcldEz1sZ6pEs6TUXDqhr/eoPp7oFQJ/AW7xhzV3RWY pKlX5aY8YOYEtJRxTi43mUq/5rbUn+X6VPFIaBYgJBUlABUhAPcax85qXkHjZUBmCgfm tlCDznZxMtfe+8UVPrJVl/Uwl52csWHXtaAopZkydEE/PWspRWZ9YNuEjF3Z3IHmdNCP cCE/UXL7AzScQ1PJNOBK6A3AsMeGL43hzbuNFRaQOHMeRpzkqLnIem1ZT2V0VyyLKbp4 9A== Received: from prod-mail-ppoint1 (prod-mail-ppoint1.akamai.com [184.51.33.18]) by mx0b-00190b01.pphosted.com with ESMTP id 2rf3r9b4e4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 26 Mar 2019 15:00:34 +0000 Received: from pps.filterd (prod-mail-ppoint1.akamai.com [127.0.0.1]) by prod-mail-ppoint1.akamai.com (8.16.0.27/8.16.0.27) with SMTP id x2QEkq30004913; Tue, 26 Mar 2019 11:00:34 -0400 Received: from prod-mail-relay14.akamai.com ([172.27.17.39]) by prod-mail-ppoint1.akamai.com with ESMTP id 2rdg4vmufq-1; Tue, 26 Mar 2019 11:00:33 -0400 Received: from [172.29.170.83] (bos-lpjec.kendall.corp.akamai.com [172.29.170.83]) by prod-mail-relay14.akamai.com (Postfix) with ESMTP id 60B6E81422; Tue, 26 Mar 2019 15:00:24 +0000 (GMT) Subject: Re: Resurrecting EPOLLROUNDROBIN To: Andy Lutomirski , Marek Majkowski Cc: Linux FS Devel , Linux API References: From: Jason Baron Openpgp: preference=signencrypt Autocrypt: addr=jbaron@akamai.com; prefer-encrypt=mutual; keydata= xsFNBFnyIJMBEADamFSO/WCelO/HZTSNbJ1YU9uoEUwmypV2TvyrTrXULcAlH1sXVHS3pNdR I/koZ1V7Ruew5HJC4K9Z5Fuw/RHYWcnQz2X+dSL6rX3BwRZEngjA4r/GDi0EqIdQeQQWCAgT VLWnIenNgmEDCoFQjFny5NMNL+i8SA6hPPRdNjxDowDhbFnkuVUBp1DBqPjHpXMzf3UYsZZx rxNY5YKFNLCpQb1cZNsR2KXZYDKUVALN3jvjPYReWkqRptOSQnvfErikwXRgCTasWtowZ4cu hJFSM5Asr/WN9Wy6oPYObI4yw+KiiWxiAQrfiQVe7fwznStaYxZ2gZmlSPG/Y2/PyoCWYbNZ mJ/7TyED5MTt22R7dqcmrvko0LIpctZqHBrWnLTBtFXZPSne49qGbjzzHywZ0OqZy9nqdUFA ZH+DALipwVFnErjEjFFRiwCWdBNpIgRrHd2bomlyB5ZPiavoHprgsV5ZJNal6fYvvgCik77u 6QgE4MWfhf3i9A8Dtyf8EKQ62AXQt4DQ0BRwhcOW5qEXIcKj33YplyHX2rdOrD8J07graX2Q 2VsRedNiRnOgcTx5Zl3KARHSHEozpHqh7SsthoP2yVo4A3G2DYOwirLcYSCwcrHe9pUEDhWF bxdyyESSm/ysAVjvENsdcreWJqafZTlfdOCE+S5fvC7BGgZu7QARAQABzR9KYXNvbiBCYXJv biA8amJhcm9uQGFrYW1haS5jb20+wsF+BBMBAgAoBQJZ8iCTAhsDBQkJZgGABgsJCAcDAgYV CAIJCgsEFgIDAQIeAQIXgAAKCRC4s7mct4u0M9E0EADBxyL30W9HnVs3x7umqUbl+uBqbBIS GIvRdMDIJXX+EEA6c82ElV2cCOS7dvE3ssG1jRR7g3omW7qEeLdy/iQiJ/qGNdcf0JWHYpmS ThZP3etrl5n7FwLm+51GPqD0046HUdoVshRs10qERDo+qnvMtTdXsfk8uoQ5lyTSvgX4s1H1 ppN1BfkG10epsAtjOJJlBoV9e92vnVRIUTnDeTVXfK11+hT5hjBxxs7uS46wVbwPuPjMlbSa ifLnt7Jz590rtzkeGrUoM5SKRL4DVZYNoAVFp/ik1fe53Wr5GJZEgDC3SNGS/u+IEzEGCytj gejvv6KDs3KcTVSp9oJ4EIZRmX6amG3dksXa4W2GEQJfPfV5+/FR8IOg42pz9RpcET32AL1n GxWzY4FokZB0G6eJ4h53DNx39/zaGX1i0cH+EkyZpfgvFlBWkS58JRFrgY25qhPZiySRLe0R TkUcQdqdK77XDJN5zmUP5xJgF488dGKy58DcTmLoaBTwuCnX2OF+xFS4bCHJy93CluyudOKs e4CUCWaZ2SsrMRuAepypdnuYf3DjP4DpEwBeLznqih4hMv5/4E/jMy1ZMdT+Q8Qz/9pjEuVF Yz2AXF83Fqi45ILNlwRjCjdmG9oJRJ+Yusn3A8EbCtsi2g443dKBzhFcmdA28m6MN9RPNAVS ucz3Oc7BTQRZ8iCTARAA2uvxdOFjeuOIpayvoMDFJ0v94y4xYdYGdtiaqnrv01eOac8msBKy 4WRNQ2vZeoilcrPxLf2eRAfsA4dx8Q8kOPvVqDc8UX6ttlHcnwxkH2X4XpJJliA6jx29kBOc oQOeL9R8c3CWL36dYbosZZwHwY5Jjs7R6TJHx1FlF9mOGIPxIx3B5SuJLsm+/WPZW1td7hS0 Alt4Yp8XWW8a/X765g3OikdmvnJryTo1s7bojmwBCtu1TvT0NrX5AJId4fELlCTFSjr+J3Up MnmkTSyovPkj8KcvBU1JWVvMnkieqrhHOmf2qdNMm61LGNG8VZQBVDMRg2szB79p54DyD+qb gTi8yb0MFqNvXGRnU/TZmLlxblHA4YLMAuLlJ3Y8Qlw5fJ7F2U1Xh6Z6m6YCajtsIF1VkUhI G2dSAigYpe6wU71Faq1KHp9C9VsxlnSR1rc4JOdj9pMoppzkjCphyX3eV9eRcfm4TItTNTGJ 7DAUQHYS3BVy1fwyuSDIJU/Jrg7WWCEzZkS4sNcBz0/GajYFM7Swybn/VTLtCiioThw4OQIw 9Afb+3sB9WR86B7N7sSUTvUArknkNDFefTJJLMzEboRMJBWzpR5OAyLxCWwVSQtPp0IdiIC2 KGF3QXccv/Q9UkI38mWvkilr3EWAOJnPgGCM/521axcyWqXsqNtIxpUAEQEAAcLBZQQYAQIA DwUCWfIgkwIbDAUJCWYBgAAKCRC4s7mct4u0M+AsD/47Q9Gi+HmLyqmaaLBzuI3mmU4vDn+f 50A/U9GSVTU/sAN83i1knpv1lmfG2DgjLXslU+NUnzwFMLI3QsXD3Xx/hmdGQnZi9oNpTMVp tG5hE6EBPsT0BM6NGbghBsymc827LhfYICiahOR/iv2yv6nucKGBM51C3A15P8JgfJcngEnM fCKRuQKWbRDPC9dEK9EBglUYoNPVNL7AWJWKAbVQyCCsJzLBgh9jIfmZ9GClu8Sxi0vu/PpA DSDSJuc9wk+m5mczzzwd4Y6ly9+iyk/CLNtqjT4sRMMV0TCl8ichxlrdt9rqltk22HXRF7ng txomp7T/zRJAqhH/EXWI6CXJPp4wpMUjEUd1B2+s1xKypq//tChF+HfUU4zXUyEXY8nHl6lk hFjW/geTcf6+i6mKaxGY4oxuIjF1s2Ak4J3viSeYfTDBH/fgUzOGI5siBhHWvtVzhQKHfOxg i8t1q09MJY6je8l8DLEIWTHXXDGnk+ndPG3foBucukRqoTv6AOY49zjrt6r++sujjkE4ax8i ClKvS0n+XyZUpHFwvwjSKc+UV1Q22BxyH4jRd1paCrYYurjNG5guGcDDa51jIz69rj6Q/4S9 Pizgg49wQXuci1kcC1YKjV2nqPC4ybeT6z/EuYTGPETKaegxN46vRVoE2RXwlVk+vmadVJlG JeQ7iQ== Message-ID: <7396178d-bc21-3dac-5eea-4990ae8ec112@akamai.com> Date: Tue, 26 Mar 2019 11:00:27 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-03-26_10:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1903260103 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-03-26_10:,, signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 lowpriorityscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1903260105 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On 3/25/19 8:23 PM, Andy Lutomirski wrote: > On Mon, Mar 25, 2019 at 4:38 AM Marek Majkowski wrote: >> >> Hi, >> >> Recently we noticed epoll is not helpful for load balancing when >> called on a listen TCP socket. I described this in a blog post: >> >> https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ >> >> The short explanation: new connections going to a listen socket are >> not evenly distributed across processes that wait on the EPOLLIN. In >> practice the last process doing epoll_wait() will get the new >> connection. See the trivial program to reproduce: >> >> https://github.com/cloudflare/cloudflare-blog/blob/master/2017-10-accept-balancing/epoll-and-accept.py >> >> $ ./epoll-and-accept.py & >> $ for i in `seq 6`; do echo | nc localhost 1024; done >> worker 0 >> worker 0 >> worker 0 >> worker 0 >> worker 0 >> worker 0 >> >> Worker #0 did all the accept() calls. This is because the listen >> socket wait queue is a LIFO (not FIFO!). With current behaviour, the >> process calling epoll_wait() most recently will be woken up first. >> This usually is the busiest process. This leads to uneven load >> distribution across worker processes. > > I recall a discussion of this at a conference several years ago, but > it's been several years. Anyway: > > I read the blog post, and I looked at your example, and the kernel > behavior actually seems quite sane to me. From the kernel's > perspective, if you're calling accept in a loop in a bunch of threads > (mediated by epoll or otherwise), and one of those threads is able to > call accept() fast enough, then that thread *should* get all the > sockets. It's cache hot, and bouncing around is expensive. Yes, the EPOLLEXCLUSIVE flag, was what we ended up after the last set of discussions on this. Its meant as sort of a sane wakeup behavior when you have one event source fd, that is attached to multiple epoll fds, or epfds. Without the EPOLLEXCLUSIVE flag, you end up with all of the epfds getting woken up. > > Now obviously the overall behavior here is suboptimal, but that's > arguably because the user process is being silly, not because the > kernel is doing it wrong. Shouldn't the user process take the newly > accepted socket and hand it off to an appropriate thread for > servicing? If I were doing this, I'd get a freshly accepted socket > and either forward it to a thread (or process) that is appropriately > lightly loaded or, even better, that is pinned to the CPU that RFS has > assigned to the flow assuming that that thread isn't overloaded. If > the program is using threads, then this doesn't need to involve the > kernel at all and, if it's using processes, then SCM_RIGHTS would do > the trick. But asking the kernel to arbitrarily and awkwardly > round-robin the sockets and then keeping the flows on the threads that > get picked means that, at best, each thread gets an arbitrary > selection of flows and the balancing isn't particularly good. > > Now, if someone were to actually try doing this in userspace and it > was too slow, I could see adding some kernel mechanisms to accelerate > the process. Perhaps a mechanism to ask to accept only new > connections that are RFSified to the calling CPU would be useful. But > this shouldn't be an *epoll* mechanism, since there is no actual > guarantee that the CPU that returns first from epoll_wait() is the > same CPU that calls accept() under load. (Under load, multiple new > connections could come in and wake multiple CPUs before any of them > manage to call accept().) > > So I think that EPOLLROUNDROBIN is not a great solution to the > problem, and I think that the problem isn't obviously a *kernel* > problem in the first place. > Another point in this direction is, yes you can try to 'balance' things at accept() time, but over time things can get very unbalanced. Long lived connections, for example, could end up on some cpus and not others. So it seems like, some sort of periodic load balancing would be necessary anyways. That said, I'm not against some basic wakeup distribution strategies in the kernel, such as round robin. Especially, if they can be entirely contained to the epoll layer (which I think we can do). But clearly we don't want to introduce a new wakeup distribution strategy for every use-case. Thanks, -Jason