From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.1 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4BB7C43381 for ; Mon, 25 Mar 2019 11:38:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 7527D2085A for ; Mon, 25 Mar 2019 11:38:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=cloudflare.com header.i=@cloudflare.com header.b="IzuDV2Ir" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730831AbfCYLih (ORCPT ); Mon, 25 Mar 2019 07:38:37 -0400 Received: from mail-qt1-f169.google.com ([209.85.160.169]:40667 "EHLO mail-qt1-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730824AbfCYLih (ORCPT ); Mon, 25 Mar 2019 07:38:37 -0400 Received: by mail-qt1-f169.google.com with SMTP id x12so9764514qts.7 for ; Mon, 25 Mar 2019 04:38:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cloudflare.com; s=google; h=mime-version:from:date:message-id:subject:to; bh=Bui9fVTDdry9prquQ9ezxOfm4g0mk1hiCRQbzIvCKDE=; b=IzuDV2IrU154FQcWacZKELE5i99Kgr8BJ4N5GvNYbtLgTJpsveVTRgwLNQFJ/xpL2Q cRz1gCPBBNb/OUbncOFzAdJXn6jG+/JwL8eVOCxmfCnRUaqEOhhjsRLVp7Yo+vN4uIxl 8kBUXUqjgyM6+EO/82Nu3AVWgIVyRz2WdmNwc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=Bui9fVTDdry9prquQ9ezxOfm4g0mk1hiCRQbzIvCKDE=; b=a89sHee2D9nXTzieUZgS2Gd+o9EtVPqsfD0UVatXAKYSPV2p79dnDMOz2v1B9UyI5R pxGCQGUo2GCJjFQ/U7f8c8NXtHnvH5qrpzWil+L/2j9RF//bpXEgUYs5bbwInoKUuXEC oA0vQQXjvyd4rjvcVvs1S2/USy4WE2CojsGTaLq/AqG+O6WAQ5pMr14eKpuprVZOnL4Q rX9Vz0iyDBSSUmYEo30VDBRzFIHg2eG2dHeniDNX93GHSFc3cFFPXdPEjy3ipkIgpF2O Zmopbqa0xVmqzDqfFnR5i7nA7R8xxzDbCJOo5LpB3Hbm+jx9ZGfIKxmU2LhzdWwI8CSG CAEQ== X-Gm-Message-State: APjAAAXE0drL4TfovJGe1XIW3LQwCjro1EjoWKQ6ZIlH22gVHp5FK9vR ZzMHFOTjMOoUHhe7ImaiDfUaANJnzBPD5ZJWB6bmW6aeGHKv/Q== X-Google-Smtp-Source: APXvYqwsRby0ZK1QwPYVPYnT9LTo081MWJMXDKFbrCNdpkD+sE+ukSmGdMsPCm9KANuBn12fRbI6vz0r5p9bcnwkeYY= X-Received: by 2002:a0c:bd81:: with SMTP id n1mr18557510qvg.65.1553513916145; Mon, 25 Mar 2019 04:38:36 -0700 (PDT) MIME-Version: 1.0 From: Marek Majkowski Date: Mon, 25 Mar 2019 12:38:24 +0100 Message-ID: Subject: Resurrecting EPOLLROUNDROBIN To: linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, jbaron@akamai.com Content-Type: text/plain; charset="UTF-8" Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org Hi, Recently we noticed epoll is not helpful for load balancing when called on a listen TCP socket. I described this in a blog post: https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ The short explanation: new connections going to a listen socket are not evenly distributed across processes that wait on the EPOLLIN. In practice the last process doing epoll_wait() will get the new connection. See the trivial program to reproduce: https://github.com/cloudflare/cloudflare-blog/blob/master/2017-10-accept-balancing/epoll-and-accept.py $ ./epoll-and-accept.py & $ for i in `seq 6`; do echo | nc localhost 1024; done worker 0 worker 0 worker 0 worker 0 worker 0 worker 0 Worker #0 did all the accept() calls. This is because the listen socket wait queue is a LIFO (not FIFO!). With current behaviour, the process calling epoll_wait() most recently will be woken up first. This usually is the busiest process. This leads to uneven load distribution across worker processes. Notice, described problem is different from what EPOLLEXCLUSIVE tries to solve. Exclusive flag is about waking up exactly one process, as opposed to default behaviour of waking up all the subscribers (thundering herd problem). Without EPOLLEXCLUSIVE the described load-balancing problem is less prominent, since there is an inherent race when all the woken up processes fight for the new connection. In such case the other workers have some chance of getting the new connection. The core problem still is there - accept calls are not well balanced across waiting processes. On a loaded server avoiding EPOLLEXCLUSIVE is wasteful. With high number of new connections, and dozens of worker processes, waking up everybody on every new connection is suboptimal. Notice, that multiple threads doing blocking accept() have a proper FIFO behaviour. In other words: you can achieve round-robin load balancing by having multiple workers hang on accept(), while you can't have that behaviour when waiting in epoll_wait(). We are using EPOLLEXCLUSIVE, and as a solution to load-balancing problem we backported the EPOLLROUNDROBIN patch submitted by Jason Byron in 2015. We are running this patch for last 6 months, and it helped us to flatten the load across workers (and reduce tail latency). https://lists.openwall.net/linux-kernel/2015/02/17/723 (PS. generally speaking EPOLLROUNDROBIN makes no sense in conjunction with SO_REUSEPORT sockets) Jason, would you mind to resubmit it? Cheers, Marek