Re: [PATCH] delayacct: track delays from ksm cow

From: David Hildenbrand <david@redhat.com>
To: CGEL <cgel.zte@gmail.com>
Cc: bsingharora@gmail.com, akpm@linux-foundation.org,
	yang.yang29@zte.com.cn, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: [PATCH] delayacct: track delays from ksm cow
Date: Mon, 21 Mar 2022 16:45:40 +0100	[thread overview]
Message-ID: <0414c610-7f56-2dd2-0d83-ac3a5194eb60@redhat.com> (raw)
In-Reply-To: <6236c600.1c69fb81.7cd4.a900@mx.google.com>

On 20.03.22 07:13, CGEL wrote:
> On Fri, Mar 18, 2022 at 09:24:44AM +0100, David Hildenbrand wrote:
>> On 18.03.22 02:41, CGEL wrote:
>>> On Thu, Mar 17, 2022 at 11:05:22AM +0100, David Hildenbrand wrote:
>>>> On 17.03.22 10:48, CGEL wrote:
>>>>> On Thu, Mar 17, 2022 at 09:17:13AM +0100, David Hildenbrand wrote:
>>>>>> On 17.03.22 03:03, CGEL wrote:
>>>>>>> On Wed, Mar 16, 2022 at 03:56:23PM +0100, David Hildenbrand wrote:
>>>>>>>> On 16.03.22 14:34, cgel.zte@gmail.com wrote:
>>>>>>>>> From: Yang Yang <yang.yang29@zte.com.cn>
>>>>>>>>>
>>>>>>>>> Delay accounting does not track the delay of ksm cow.  When tasks
>>>>>>>>> have many ksm pages, it may spend a amount of time waiting for ksm
>>>>>>>>> cow.
>>>>>>>>>
>>>>>>>>> To get the impact of tasks in ksm cow, measure the delay when ksm
>>>>>>>>> cow happens. This could help users to decide whether to user ksm
>>>>>>>>> or not.
>>>>>>>>>
>>>>>>>>> Also update tools/accounting/getdelays.c:
>>>>>>>>>
>>>>>>>>>     / # ./getdelays -dl -p 231
>>>>>>>>>     print delayacct stats ON
>>>>>>>>>     listen forever
>>>>>>>>>     PID     231
>>>>>>>>>
>>>>>>>>>     CPU             count     real total  virtual total    delay total  delay average
>>>>>>>>>                      6247     1859000000     2154070021     1674255063          0.268ms
>>>>>>>>>     IO              count    delay total  delay average
>>>>>>>>>                         0              0              0ms
>>>>>>>>>     SWAP            count    delay total  delay average
>>>>>>>>>                         0              0              0ms
>>>>>>>>>     RECLAIM         count    delay total  delay average
>>>>>>>>>                         0              0              0ms
>>>>>>>>>     THRASHING       count    delay total  delay average
>>>>>>>>>                         0              0              0ms
>>>>>>>>>     KSM             count    delay total  delay average
>>>>>>>>>                      3635      271567604              0ms
>>>>>>>>>
>>>>>>>>
>>>>>>>> TBH I'm not sure how particularly helpful this is and if we want this.
>>>>>>>>
>>>>>>> Thanks for replying.
>>>>>>>
>>>>>>> Users may use ksm by calling madvise(, , MADV_MERGEABLE) when they want
>>>>>>> save memory, it's a tradeoff by suffering delay on ksm cow. Users can
>>>>>>> get to know how much memory ksm saved by reading
>>>>>>> /sys/kernel/mm/ksm/pages_sharing, but they don't know what the costs of
>>>>>>> ksm cow delay, and this is important of some delay sensitive tasks. If
>>>>>>> users know both saved memory and ksm cow delay, they could better use
>>>>>>> madvise(, , MADV_MERGEABLE).
>>>>>>
>>>>>> But that happens after the effects, no?
>>>>>>
>>>>>> IOW a user already called madvise(, , MADV_MERGEABLE) and then gets the
>>>>>> results.
>>>>>>
>>>>> Image user are developing or porting their applications on experiment
>>>>> machine, they could takes those benchmark as feedback to adjust whether
>>>>> to use madvise(, , MADV_MERGEABLE) or it's range.
>>>>
>>>> And why can't they run it with and without and observe performance using
>>>> existing metrics (or even application-specific metrics?)?
>>>>
>>>>
>>> I think the reason why we need this patch, is just like why we need                                                                                                     
>>> swap,reclaim,thrashing getdelay information. When system is complex,
>>> it's hard to precise tell which kernel activity impact the observe
>>> performance or application-specific metrics, preempt? cgroup throttle?
>>> swap? reclaim? IO?
>>>
>>> So if we could get the factor's precise impact data, when we are tunning
>>> the factor(for this patch it's ksm), it's more efficient.
>>>
>>
>> I'm not convinced that we want to make or write-fault handler more
>> complicated for such a corner case with an unclear, eventual use case.
> 
> IIRC, KSM is designed for VM. But recently we found KSM works well for
> system with many containers(save about 10%~20% of total memroy), and
> container technology is more popular today, so KSM may be used more.
> 
> To reduce the impact for write-fault handler, we may write a new function
> with ifdef CONFIG_KSM inside to do this job?

Maybe we just want to catch the impact of the write-fault handler when
copying more generally?

> 
>> IIRC, whenever using KSM you're already agreeing to eventually pay a
>> performance price, and the price heavily depends on other factors in the
>> system. Simply looking at the number of write-faults might already give
>> an indication what changed with KSM being enabled.
>>
> While saying "you're already agreeing to pay a performance price", I think
> this is the shortcoming of KSM that putting off it being used more widely.
> It's not easy for user/app to decide how to use madvise(, ,MADV_MERGEABLE).

... and my point is that the metric you're introducing might absolutely
not be expressive for such users playing with MADV_MERGEABLE. IMHO
people will look at actual application performance to figure out what
"harm" will be done, no?

But I do see value in capturing how many COW we have in general --
either via a counter or via a delay as proposed by you.

> 
> Is there a more easy way to use KSM, enjoying memory saving while minimum
> the performance price for container? We think it's possible, and are working
> for a new patch: provide a knob for cgroup to enable/disable KSM for all tasks
> in this cgroup, so if your container is delay sensitive just leave it, and if
> not you can easy to enable KSM without modify app code.
> 
> Before using the new knob, user might want to know the precise impact of KSM.
> I think write-faults is indirection. If indirection is good enough, why we need
> taskstats and PSI? By the way, getdelays support container statistics.

Would anything speak against making this more generic and capturing the
delay for any COW, not just for KSM?

-- 
Thanks,

David / dhildenb