From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S267579AbUHXK2I (ORCPT ); Tue, 24 Aug 2004 06:28:08 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S267464AbUHXK1g (ORCPT ); Tue, 24 Aug 2004 06:27:36 -0400 Received: from dns1.seagha.com ([217.66.0.18]:15076 "EHLO ndns1.seagha.com") by vger.kernel.org with ESMTP id S267405AbUHXK1Q (ORCPT ); Tue, 24 Aug 2004 06:27:16 -0400 Message-ID: <6DED3619289CD311BCEB00508B8E133601A68B1E@nt-server2.antwerp.seagha.com> From: Karl Vogel To: "'Jens Axboe'" Cc: Marcelo Tosatti , linux-kernel@vger.kernel.org, Ingo Molnar Subject: RE: Kernel 2.6.8.1: swap storm of death - CFQ scheduler=culprit Date: Tue, 24 Aug 2004 12:28:13 +0200 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2653.19) Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org > > > Original post with testcase + stats: > > > http://article.gmane.org/gmane.linux.kernel/228156 > > > > 2.6.8.1-mm4 clean does not reproduce the problem. Marcelo, your > > 2.6.8-rc4 report is not valid due to the fixed problem > related to that > > in CFQ already. I'd still like for you to retest with 2.6.8.1. > > Did some extra testing yesterday. When not running X or anything substantial, I'm able to trigger it after running the expunge 2 or 3 times in a row. If I increase the calloc size, it triggers faster (tried with 1Gb calloc on a 512Mb box with 1Gb swap partition). The first expunge run, completes fine. The ones after that, get OOM killed and I get a printk about page allocation order 0 failure. The 2.6.8.1-mm4 was a clean version, but I will double check this, this evening. I also tried with deadline, but was unable to trigger it. > Oh, and please do also do a sysrq-t from a hung box and save > the output. Note: the box doesn't hang completely. Just some processes get stuck in 'D' and the machine swaps heavily. The tests of yesterday evening, did recover. So I'm guessing if I had waited long enough the box would have recovered on the previous tests. Looking at the vmstat from my previous tests, shows that the box was low on memory (free/buff/cache are all very low): http://users.telenet.be/kvogel/vmstat-after-kill.txt That was probably why it was swapping like mad. Will provide you with that sysrq-t this evening. Karl.