From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933225AbcI3JtS (ORCPT ); Fri, 30 Sep 2016 05:49:18 -0400 Received: from mail-wm0-f68.google.com ([74.125.82.68]:33400 "EHLO mail-wm0-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932088AbcI3JtJ (ORCPT ); Fri, 30 Sep 2016 05:49:09 -0400 Date: Fri, 30 Sep 2016 11:49:05 +0200 From: Michal Hocko To: Rik van Riel Cc: Minchan Kim , Sonny Rao , Jann Horn , Robert Foss , corbet@lwn.net, Andrew Morton , Vlastimil Babka , Konstantin Khlebnikov , Hugh Dickins , Naoya Horiguchi , John Stultz , ross.zwisler@linux.intel.com, jmarchan@redhat.com, Johannes Weiner , Kees Cook , Al Viro , Cyrill Gorcunov , Robin Humble , David Rientjes , eric.engestrom@imgtec.com, Janis Danisevskis , calvinowens@fb.com, Alexey Dobriyan , "Kirill A. Shutemov" , ldufour@linux.vnet.ibm.com, linux-doc@vger.kernel.org, "linux-kernel@vger.kernel.org" , Ben Zhang , Bryan Freed , Filipe Brandenburger , Mateusz Guzik , Mel Gorman , Mike Galbraith Subject: Re: utime accounting regression since 4.6 (was: Re: [PACTH v2 0/3] Implement /proc//totmaps) Message-ID: <20160930094905.GA22867@dhcp22.suse.cz> References: <20160818180104.GS30162@dhcp22.suse.cz> <1471988771.2751.29.camel@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1471988771.2751.29.camel@redhat.com> User-Agent: Mutt/1.6.0 (2016-04-01) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [CC Mike and Mel as they have seen some accounting oddities when doing performance testing. They can share details but essentially the system time just gets too high] For your reference the email thread started http://lkml.kernel.org/r/20160823143330.GL23577@dhcp22.suse.cz I suspect this is mainly for short lived processes - like kernel compile $ /usr/bin/time -v make mm/mmap.o [...] User time (seconds): 0.45 System time (seconds): 0.82 Percent of CPU this job got: 111% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.14 $ rm mm/mmap.o $ /usr/bin/time -v make mm/mmap.o [...] User time (seconds): 0.47 System time (seconds): 1.55 Percent of CPU this job got: 107% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.88 This is quite unexpected for a cache hot compile. I would expect most of the time being spent in userspace. $ perf report | grep kernel.vmlinux 2.01% as [kernel.vmlinux] [k] page_fault 0.59% cc1 [kernel.vmlinux] [k] page_fault 0.15% git [kernel.vmlinux] [k] page_fault 0.12% bash [kernel.vmlinux] [k] page_fault 0.11% sh [kernel.vmlinux] [k] page_fault 0.08% gcc [kernel.vmlinux] [k] page_fault 0.06% make [kernel.vmlinux] [k] page_fault 0.04% rm [kernel.vmlinux] [k] page_fault 0.03% ld [kernel.vmlinux] [k] page_fault 0.02% bash [kernel.vmlinux] [k] entry_SYSCALL_64 0.01% git [kernel.vmlinux] [k] entry_SYSCALL_64 0.01% cat [kernel.vmlinux] [k] page_fault 0.01% collect2 [kernel.vmlinux] [k] page_fault 0.00% sh [kernel.vmlinux] [k] entry_SYSCALL_64 0.00% rm [kernel.vmlinux] [k] entry_SYSCALL_64 0.00% grep [kernel.vmlinux] [k] page_fault doesn't show anything unexpected. Original Rik's reply follows: On Tue 23-08-16 17:46:11, Rik van Riel wrote: > On Tue, 2016-08-23 at 16:33 +0200, Michal Hocko wrote: [...] > > OK, so it seems I found it. I was quite lucky because > > account_user_time > > is not all that popular function and there were basically no changes > > besides Riks ff9a9b4c4334 ("sched, time: Switch > > VIRT_CPU_ACCOUNTING_GEN > > to jiffy granularity") and that seems to cause the regression. > > Reverting > > the commit on top of the current mmotm seems to fix the issue for me. > > > > And just to give Rik more context. While debugging overhead of the > > /proc//smaps I am getting a misleading output from /usr/bin/time > > -v > > (source for ./max_mmap is [1]) > > > > root@test1:~# uname -r > > 4.5.0-rc6-bisect1-00025-gff9a9b4c4334 > > root@test1:~# ./max_map  > > pid:2990 maps:65515 > > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} > > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/2990/smaps > > rss:263368 pss:262203 > >         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > > {printf "rss:%d pss:%d\n", rss, pss} /proc/2990/smaps" > >         User time (seconds): 0.00 > >         System time (seconds): 0.45 > >         Percent of CPU this job got: 98% > > > > > root@test1:~# /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} > > END {printf "rss:%d pss:%d\n", rss, pss}' /proc/3015/smaps > > rss:263316 pss:262199 > >         Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > > {printf "rss:%d pss:%d\n", rss, pss} /proc/3015/smaps" > >         User time (seconds): 0.18 > >         System time (seconds): 0.29 > >         Percent of CPU this job got: 97% > > The patch in question makes user and system > time accounting essentially tick-based. If > jiffies changes while the task is in user > mode, time gets accounted as user time, if > jiffies changes while the task is in system > mode, time gets accounted as system time. > > If you get "unlucky", with a job like the > above, it is possible all time gets accounted > to system time. > > This would be true both with the system running > with a periodic timer tick (before and after my > patch is applied), and in nohz_idle mode (after > my patch). > > However, it does seem quite unlikely that you > get zero user time, since you have 125 timer > ticks in half a second. Furthermore, you do not > even have NO_HZ_FULL enabled... > > Does the workload consistently get zero user > time? > > If so, we need to dig further to see under > what precise circumstances that happens. > > On my laptop, with kernel 4.6.3-300.fc24.x86_64 > I get this: > > $ /usr/bin/time -v awk '/^Rss/{rss+=$2} /^Pss/{pss+=$2} END {printf > "rss:%d pss:%d\n", rss, pss}' /proc/19825/smaps > rss:263368 pss:262145 > Command being timed: "awk /^Rss/{rss+=$2} /^Pss/{pss+=$2} END > {printf "rss:%d pss:%d\n", rss, pss} /proc/19825/smaps" > User time (seconds): 0.64 > System time (seconds): 0.19 > Percent of CPU this job got: 99% > Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.83 > > The main difference between your and my > NO_HZ config seems to be that NO_HZ_FULL > is set here. However, it is not enabled > at run time, so both of our systems > should only really get NO_HZ_IDLE > effectively. > > Running tasks should get sampled with the > regular timer tick, while they are running. > > In other words, vtime accounting should be > disabled in both of our tests, for everything > except the idle task. > > Do I need to do anything special to reproduce > your bug, besides running the max mmap program > and the awk script? > > -- Michal Hocko SUSE Labs