From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752622AbcHKPVd (ORCPT ); Thu, 11 Aug 2016 11:21:33 -0400 Received: from mail-wm0-f68.google.com ([74.125.82.68]:34637 "EHLO mail-wm0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751257AbcHKPV3 (ORCPT ); Thu, 11 Aug 2016 11:21:29 -0400 MIME-Version: 1.0 In-Reply-To: <20160704094342.108621834@linutronix.de> References: <20160704093956.299369787@linutronix.de> <20160704094342.108621834@linutronix.de> From: Jouni Malinen Date: Thu, 11 Aug 2016 18:21:26 +0300 Message-ID: Subject: Re: [patch 4 14/22] timer: Switch to a non cascading wheel To: Thomas Gleixner Cc: LKML , Ingo Molnar , Peter Zijlstra , Paul McKenney , Frederic Weisbecker , Chris Mason , Arjan van de Ven , rt@linutronix.de, Rik van Riel , George Spelvin , Len Brown , Josh Triplett , Eric Dumazet Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jul 4, 2016 at 12:50 PM, Thomas Gleixner wrote: > The current timer wheel has some drawbacks: ... It looks like this change (commit 500462a9de657f86edaa102f8ab6bff7f7e43fc2 in linux.git) breaks one of the automated test cases I'm using to test hostapd and wpa_supplicant with mac80211_hwsim from the kernel. I'm not sure what exactly causes this (did not really expect git bisect to point to timers..), but this seems to be very reproducible for me under kvm (though, this apparently did not happen on another device, so I'm not completely sure what it is needed to reproduce) with the ap_wps_er_http_proto test cases failing to connect 20 TCP stream sockets to a server on the localhost. The client side is a python test script and the server is hostapd. The failure shows up with about the 13th of those socket connects failing while all others (both before and after this failed one) going through. Would you happen to have any idea why this commit has such a difference in behavior? I'm currently working around this in my test script with the following change, but it might be worth while to confirm whether there is something in the kernel change that resulted in unexpected behavior. http://w1.fi/cgit/hostap/commit/?id=2d6a526ac3885605f34df4037fc79ad330565b23 The test code looked like this in python: addr = (url.hostname, url.port) socks = {} for i in range(20): socks[i] = socket.socket(socket.AF_INET, socket.SOCK_STREAM, socket.IPPROTO_TCP) socks[i].connect(addr) With that connect() call being the failing (time out) operation and it seemed to happen for i == 13 most of the time. This shows up only with commit 500462a9de657f86edaa102f8ab6bff7f7e43fc2 included in the kernel (i.e., test with commit b0d6e2dcb284f1f4dcb4b92760f49eeaf5fc0bc7 as the kernel snapshot does not show this behavior). Changes in 500462a9 were not trivial to revert on top of the current master, so I have not checked whether the current master branch would get rid of the failure if only this one commit were reverted. I can reproduce this easily, so if someone wants to get more details of the issue, just let me know how to collect whatever would be useful. - Jouni