From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S261250AbVABMEm (ORCPT ); Sun, 2 Jan 2005 07:04:42 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S261251AbVABMEm (ORCPT ); Sun, 2 Jan 2005 07:04:42 -0500 Received: from one.firstfloor.org ([213.235.205.2]:45493 "EHLO one.firstfloor.org") by vger.kernel.org with ESMTP id S261250AbVABMEk (ORCPT ); Sun, 2 Jan 2005 07:04:40 -0500 To: Nick Piggin Cc: linux-kernel@vger.kernel.org, Arjan van de Ven Subject: Re: 2.5isms References: <20041231230624.GA29411@andromeda> <41D60C35.9000503@yahoo.com.au> <41D743BE.3060207@yahoo.com.au> From: Andi Kleen Date: Sun, 02 Jan 2005 13:04:39 +0100 In-Reply-To: <41D743BE.3060207@yahoo.com.au> (Nick Piggin's message of "Sun, 02 Jan 2005 11:43:42 +1100") Message-ID: User-Agent: Gnus/5.110002 (No Gnus v0.2) Emacs/21.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Nick Piggin writes: > > I'm curious about a couple of points though. First, is that it is basically > just adding a cache colouring to the stack, right? In that case why do only > older HT CPUs have bad performance without it? And wouldn't it possibly make Intel improved the HT implementation over time (there are at least three generations) In particular the latest "Prescott" CPUs lost a lot of problems earlier versions have. As far as I know they improved the caches to make the cache thrashing problem less severe. > even non HT CPUs possibly slightly more efficient WRT caching the stacks of > multiple processes? Not on x86 no because they normally have physically indexed caches (except for L1, but that is not really preserved over a context switch) HT is just a special case because two threads essentially share cache. In theory it could help on non x86 CPUs with virtually indexed caches, but it is doubtful if they don't need more advanced forms of cache colouring. > Second, on what workloads does performance suffer, can you remember? I wonder > if natural variations in the stack pointer as the program runs would mitigate > the effect of this on all but micro benchmarks? iirc on lots of different workloas that run code on both virtual CPUs at the same time. Without it you would get L1 cache thrashing, which can slow things down quite a lot. And yes it made a real difference. The P4 cache have some pecularities ("64K aliasing") that made the problem worse. -Andi