All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-05-31 18:18 ` Waiman Long
  0 siblings, 0 replies; 5+ messages in thread
From: Waiman Long @ 2022-05-31 18:18 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups, linux-block, linux-kernel, Ming Lei, Waiman Long

For a system with many CPUs and block devices, the time to do
blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
can be especially problematic as interrupt is disabled during the flush.
It was reported that it might take seconds in some extreme cases leading
to hard lockup messages.

As it is likely that not all the percpu blkg_iostat_set's has been
updated since the last flush, those stale blkg_iostat_set's don't need
to be flushed in this case. This patch optimizes blkcg_rstat_flush()
by checking the current sequence number against the one recorded since
the last flush and skip the blkg_iostat_set if the sequence number
hasn't changed. There is a slight chance that it may miss an update
that is being done in parallel, the new update will just have to wait
until the next flush.

Signed-off-by: Waiman Long <longman@redhat.com>
---
 block/blk-cgroup.c | 18 +++++++++++++++---
 block/blk-cgroup.h |  1 +
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 40161a3f68d0..79b89af61ef2 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -864,11 +864,23 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 		unsigned long flags;
 		unsigned int seq;
 
+		seq = u64_stats_fetch_begin(&bisc->sync);
+		/*
+		 * If the sequence number hasn't been updated since the last
+		 * flush, we can skip this blkg_iostat_set though we may miss
+		 * an update that is happening in parallel.
+		 */
+		if (seq == bisc->last_seq)
+			continue;
+
 		/* fetch the current per-cpu values */
-		do {
-			seq = u64_stats_fetch_begin(&bisc->sync);
+		while (true) {
+			bisc->last_seq = seq;
 			blkg_iostat_set(&cur, &bisc->cur);
-		} while (u64_stats_fetch_retry(&bisc->sync, seq));
+			if (!u64_stats_fetch_retry(&bisc->sync, seq))
+				break;
+			seq = u64_stats_fetch_begin(&bisc->sync);
+		}
 
 		/* propagate percpu delta to global */
 		flags = u64_stats_update_begin_irqsave(&blkg->iostat.sync);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index d4de0a35e066..22b4ea139b93 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -45,6 +45,7 @@ struct blkg_iostat_set {
 	struct u64_stats_sync		sync;
 	struct blkg_iostat		cur;
 	struct blkg_iostat		last;
+	unsigned int			last_seq;
 };
 
 /* association between a blk cgroup and a request queue */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH] blk-cgroup: Optimize blkcg_rstat_flush()
@ 2022-05-31 18:18 ` Waiman Long
  0 siblings, 0 replies; 5+ messages in thread
From: Waiman Long @ 2022-05-31 18:18 UTC (permalink / raw)
  To: Tejun Heo, Jens Axboe
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-block-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Ming Lei, Waiman Long

For a system with many CPUs and block devices, the time to do
blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
can be especially problematic as interrupt is disabled during the flush.
It was reported that it might take seconds in some extreme cases leading
to hard lockup messages.

As it is likely that not all the percpu blkg_iostat_set's has been
updated since the last flush, those stale blkg_iostat_set's don't need
to be flushed in this case. This patch optimizes blkcg_rstat_flush()
by checking the current sequence number against the one recorded since
the last flush and skip the blkg_iostat_set if the sequence number
hasn't changed. There is a slight chance that it may miss an update
that is being done in parallel, the new update will just have to wait
until the next flush.

Signed-off-by: Waiman Long <longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/blk-cgroup.c | 18 +++++++++++++++---
 block/blk-cgroup.h |  1 +
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 40161a3f68d0..79b89af61ef2 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -864,11 +864,23 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 		unsigned long flags;
 		unsigned int seq;
 
+		seq = u64_stats_fetch_begin(&bisc->sync);
+		/*
+		 * If the sequence number hasn't been updated since the last
+		 * flush, we can skip this blkg_iostat_set though we may miss
+		 * an update that is happening in parallel.
+		 */
+		if (seq == bisc->last_seq)
+			continue;
+
 		/* fetch the current per-cpu values */
-		do {
-			seq = u64_stats_fetch_begin(&bisc->sync);
+		while (true) {
+			bisc->last_seq = seq;
 			blkg_iostat_set(&cur, &bisc->cur);
-		} while (u64_stats_fetch_retry(&bisc->sync, seq));
+			if (!u64_stats_fetch_retry(&bisc->sync, seq))
+				break;
+			seq = u64_stats_fetch_begin(&bisc->sync);
+		}
 
 		/* propagate percpu delta to global */
 		flags = u64_stats_update_begin_irqsave(&blkg->iostat.sync);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index d4de0a35e066..22b4ea139b93 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -45,6 +45,7 @@ struct blkg_iostat_set {
 	struct u64_stats_sync		sync;
 	struct blkg_iostat		cur;
 	struct blkg_iostat		last;
+	unsigned int			last_seq;
 };
 
 /* association between a blk cgroup and a request queue */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-05-31 18:18 ` Waiman Long
  (?)
@ 2022-05-31 18:24 ` Tejun Heo
  2022-05-31 19:03   ` Waiman Long
  -1 siblings, 1 reply; 5+ messages in thread
From: Tejun Heo @ 2022-05-31 18:24 UTC (permalink / raw)
  To: Waiman Long; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

Hello, Waiman.

On Tue, May 31, 2022 at 02:18:21PM -0400, Waiman Long wrote:
> For a system with many CPUs and block devices, the time to do
> blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
> can be especially problematic as interrupt is disabled during the flush.
> It was reported that it might take seconds in some extreme cases leading
> to hard lockup messages.
> 
> As it is likely that not all the percpu blkg_iostat_set's has been
> updated since the last flush, those stale blkg_iostat_set's don't need
> to be flushed in this case. This patch optimizes blkcg_rstat_flush()
> by checking the current sequence number against the one recorded since
> the last flush and skip the blkg_iostat_set if the sequence number
> hasn't changed. There is a slight chance that it may miss an update
> that is being done in parallel, the new update will just have to wait
> until the next flush.
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  block/blk-cgroup.c | 18 +++++++++++++++---
>  block/blk-cgroup.h |  1 +
>  2 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 40161a3f68d0..79b89af61ef2 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -864,11 +864,23 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>  		unsigned long flags;
>  		unsigned int seq;
>  
> +		seq = u64_stats_fetch_begin(&bisc->sync);
> +		/*
> +		 * If the sequence number hasn't been updated since the last
> +		 * flush, we can skip this blkg_iostat_set though we may miss
> +		 * an update that is happening in parallel.
> +		 */
> +		if (seq == bisc->last_seq)
> +			continue;

Is this a sufficient solution? The code assumes that there aren't too many
blkgs for the cgroup, which can be wrong in some cases. Wouldn't it be
better to create a list of updated blkg's per blkcg so that we don't walk
all the dormant ones?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-05-31 18:24 ` Tejun Heo
@ 2022-05-31 19:03   ` Waiman Long
  2022-05-31 19:22     ` Waiman Long
  0 siblings, 1 reply; 5+ messages in thread
From: Waiman Long @ 2022-05-31 19:03 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On 5/31/22 14:24, Tejun Heo wrote:
> Hello, Waiman.
>
> On Tue, May 31, 2022 at 02:18:21PM -0400, Waiman Long wrote:
>> For a system with many CPUs and block devices, the time to do
>> blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
>> can be especially problematic as interrupt is disabled during the flush.
>> It was reported that it might take seconds in some extreme cases leading
>> to hard lockup messages.
>>
>> As it is likely that not all the percpu blkg_iostat_set's has been
>> updated since the last flush, those stale blkg_iostat_set's don't need
>> to be flushed in this case. This patch optimizes blkcg_rstat_flush()
>> by checking the current sequence number against the one recorded since
>> the last flush and skip the blkg_iostat_set if the sequence number
>> hasn't changed. There is a slight chance that it may miss an update
>> that is being done in parallel, the new update will just have to wait
>> until the next flush.
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   block/blk-cgroup.c | 18 +++++++++++++++---
>>   block/blk-cgroup.h |  1 +
>>   2 files changed, 16 insertions(+), 3 deletions(-)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index 40161a3f68d0..79b89af61ef2 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -864,11 +864,23 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>>   		unsigned long flags;
>>   		unsigned int seq;
>>   
>> +		seq = u64_stats_fetch_begin(&bisc->sync);
>> +		/*
>> +		 * If the sequence number hasn't been updated since the last
>> +		 * flush, we can skip this blkg_iostat_set though we may miss
>> +		 * an update that is happening in parallel.
>> +		 */
>> +		if (seq == bisc->last_seq)
>> +			continue;
> Is this a sufficient solution? The code assumes that there aren't too many
> blkgs for the cgroup, which can be wrong in some cases. Wouldn't it be
> better to create a list of updated blkg's per blkcg so that we don't walk
> all the dormant ones?

It is probably not a sufficient solution, but it is simple. The problem 
with keeping a list of recently updated blkg's is that sequence lock 
does not provide enough synchronization on the read side to guarantee a 
race free reset of the list. It may be doable, but I need to think 
harder on the best way to do it without too much overhead.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] blk-cgroup: Optimize blkcg_rstat_flush()
  2022-05-31 19:03   ` Waiman Long
@ 2022-05-31 19:22     ` Waiman Long
  0 siblings, 0 replies; 5+ messages in thread
From: Waiman Long @ 2022-05-31 19:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Jens Axboe, cgroups, linux-block, linux-kernel, Ming Lei

On 5/31/22 15:03, Waiman Long wrote:
> On 5/31/22 14:24, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Tue, May 31, 2022 at 02:18:21PM -0400, Waiman Long wrote:
>>> For a system with many CPUs and block devices, the time to do
>>> blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
>>> can be especially problematic as interrupt is disabled during the 
>>> flush.
>>> It was reported that it might take seconds in some extreme cases 
>>> leading
>>> to hard lockup messages.
>>>
>>> As it is likely that not all the percpu blkg_iostat_set's has been
>>> updated since the last flush, those stale blkg_iostat_set's don't need
>>> to be flushed in this case. This patch optimizes blkcg_rstat_flush()
>>> by checking the current sequence number against the one recorded since
>>> the last flush and skip the blkg_iostat_set if the sequence number
>>> hasn't changed. There is a slight chance that it may miss an update
>>> that is being done in parallel, the new update will just have to wait
>>> until the next flush.
>>>
>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>> ---
>>>   block/blk-cgroup.c | 18 +++++++++++++++---
>>>   block/blk-cgroup.h |  1 +
>>>   2 files changed, 16 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>>> index 40161a3f68d0..79b89af61ef2 100644
>>> --- a/block/blk-cgroup.c
>>> +++ b/block/blk-cgroup.c
>>> @@ -864,11 +864,23 @@ static void blkcg_rstat_flush(struct 
>>> cgroup_subsys_state *css, int cpu)
>>>           unsigned long flags;
>>>           unsigned int seq;
>>>   +        seq = u64_stats_fetch_begin(&bisc->sync);
>>> +        /*
>>> +         * If the sequence number hasn't been updated since the last
>>> +         * flush, we can skip this blkg_iostat_set though we may miss
>>> +         * an update that is happening in parallel.
>>> +         */
>>> +        if (seq == bisc->last_seq)
>>> +            continue;
>> Is this a sufficient solution? The code assumes that there aren't too 
>> many
>> blkgs for the cgroup, which can be wrong in some cases. Wouldn't it be
>> better to create a list of updated blkg's per blkcg so that we don't 
>> walk
>> all the dormant ones?
>
> It is probably not a sufficient solution, but it is simple. The 
> problem with keeping a list of recently updated blkg's is that 
> sequence lock does not provide enough synchronization on the read side 
> to guarantee a race free reset of the list. It may be doable, but I 
> need to think harder on the best way to do it without too much overhead.
>
> Thanks,
> Longman
>
I think we can use a lockless list to keep a list of recently updated 
blkg for each blkcg. There is an atomic operation overhead when an entry 
is added to the list though. I will send an updated patch later today or 
tomorrow.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-05-31 19:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-31 18:18 [PATCH] blk-cgroup: Optimize blkcg_rstat_flush() Waiman Long
2022-05-31 18:18 ` Waiman Long
2022-05-31 18:24 ` Tejun Heo
2022-05-31 19:03   ` Waiman Long
2022-05-31 19:22     ` Waiman Long

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.