From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932780Ab1DMBeW (ORCPT ); Tue, 12 Apr 2011 21:34:22 -0400 Received: from smtp-out.google.com ([74.125.121.67]:48325 "EHLO smtp-out.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932750Ab1DMBeV (ORCPT ); Tue, 12 Apr 2011 21:34:21 -0400 DomainKey-Signature: a=rsa-sha1; c=nofws; d=google.com; s=beta; h=date:from:x-x-sender:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version:content-type; b=FV4PVgy2qyTHQn7OtoAAFh2/CmHf7XUxYVjQG1kLYI1jyA4aJJjCK6XEwbmhL8EE5X Fe+6Ah2XtbHFzsysQmpg== Date: Tue, 12 Apr 2011 18:34:15 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Christoph Lameter cc: Peter Kruse , eric.dumazet@gmail.com, linux-kernel@vger.kernel.org Subject: Re: I have a blaze of 353 page allocation failures, all alike In-Reply-To: Message-ID: References: <4D53FE43.8030106@q-leap.com> <4D5A2EDB.8060603@q-leap.com> <4D5BC16A.2090205@q-leap.com> <4D5BF56F.1000504@q-leap.com> <4D5CCEED.3010501@q-leap.com> <272bf0cc51439a2ab31ee2f06317dd9f.squirrel@www.q-leap.de> <4D6648B5.1090306@q-leap.com> <4DA4692D.7080207@q-leap.de> User-Agent: Alpine 2.00 (DEB 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-System-Of-Record: true Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 12 Apr 2011, Christoph Lameter wrote: > > > it took a while to find a date for a reboot... Unfortunately > > > it was not possible to get the early boot messages with the > > > kernel 2.6.32.23 since the compiled in log buffer is too > > > small. So we installed as you suggested a more recent kernel > > > 2.6.32.29 with a bigger log buffer, I attach the dmesg > > > of that, and hope that the information in there is useful. > > > We will keep an eye on that server with the newer kernel > > > to see if the allocation failures appear again. > > > > the server was running for a few without any more allocation > > failures with kernel 2.6.32.29 but at one point the server > > stopped responding, it was still possible for a while to > > get a login, and trying to kill some processes but that > > didn't succeed. But after that even login was > > no longer possible so we had to reset it. > > I attach the call trace, I hope you can find out what is > > the problem. > > The problem maybe that you have lots and lots of SCSI devices which > consume ZONE_DMA memory for their control structures. I guess that is > oversubscribing the 16M zone. > You can try to get more memory reserves specifically for lowmem in ZONE_DMA by changing /proc/sys/vm/lowmem_reserve_ratio. The values are ratios, so lowering the numbers will yield larger amounts of memory reserves in ZONE_DMA for GFP_DMA allocations. Try lowering the non-zero entries to 1 to reserve the entire zone for lowmem, assuming your system has enough RAM for everything else you're running. This will verify if ZONE_DMA is being depleted from the larger number of SCSI devices. If you don't get any additional page allocation failures, then check how much memory in ZONE_DMA is used at peak and that would be a sane reserve ratio to use next time you restart the system.