linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Making per-cpu lists draining dependant on a flag
@ 2015-10-09 11:00 Nikolay Borisov
  2015-10-13 14:43 ` Michal Hocko
  2015-10-14  8:37 ` Michal Hocko
  0 siblings, 2 replies; 5+ messages in thread
From: Nikolay Borisov @ 2015-10-09 11:00 UTC (permalink / raw)
  To: linux-mm
  Cc: mgorman, Andrew Morton, Michal Hocko, Marian Marinov,
	SiteGround Operations

Hello mm people,


I want to ask you the following question which stemmed from analysing
and chasing this particular deadlock:
http://permalink.gmane.org/gmane.linux.kernel/2056730

To summarise it:

For simplicity I will use the following nomenclature:
t1 - kworker/u96:0
t2 - kworker/u98:39
t3 - kworker/u98:7

t1 issues drain_all_pages which generates IPI's, at the same time
however, t2 has already started doing async write of pages
as part of its normal operation but is blocked upon t1 completion of
its IPI (generated from drain_all_pages) since they both work on the
same dm-thin volume. At the same time again, t3 is executing
ext4_finish_bio, which disables interrupts, yet is dependent on t2
completing its writes.  But since it has disabled interrupts, it wont
respond to t1's IPI and at this point a hard lock up occurs. This
happens, since drain_all_pages calls on_each_cpu_mask with the last
argument equal to  "true" meaning "wait until the ipi handler has
finished", which of course will never happen in the described situation.

Based on that I was wondering whether avoiding such situation might
merit making drain_all_pages invocation from
__alloc_pages_direct_reclaim dependent on a particular GFP being passed
e.g. GFP_NOPCPDRAIN or something along those lines?

Alternatively would it be possible to make the IPI asycnrhonous e.g.
calling on_each_cpu_mask with the last argument equal to false?

Regards,
Nikolay

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Making per-cpu lists draining dependant on a flag
  2015-10-09 11:00 Making per-cpu lists draining dependant on a flag Nikolay Borisov
@ 2015-10-13 14:43 ` Michal Hocko
  2015-10-13 14:55   ` Nikolay Borisov
  2015-10-14  8:37 ` Michal Hocko
  1 sibling, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2015-10-13 14:43 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-mm, mgorman, Andrew Morton, Marian Marinov, SiteGround Operations

On Fri 09-10-15 14:00:31, Nikolay Borisov wrote:
> Hello mm people,
> 
> 
> I want to ask you the following question which stemmed from analysing
> and chasing this particular deadlock:
> http://permalink.gmane.org/gmane.linux.kernel/2056730
> 
> To summarise it:
> 
> For simplicity I will use the following nomenclature:
> t1 - kworker/u96:0
> t2 - kworker/u98:39
> t3 - kworker/u98:7

Could you be more specific about the trace of all three parties?
I am not sure I am completely following your description. Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Making per-cpu lists draining dependant on a flag
  2015-10-13 14:43 ` Michal Hocko
@ 2015-10-13 14:55   ` Nikolay Borisov
  0 siblings, 0 replies; 5+ messages in thread
From: Nikolay Borisov @ 2015-10-13 14:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, mgorman, Andrew Morton, Marian Marinov, SiteGround Operations



On 10/13/2015 05:43 PM, Michal Hocko wrote:
> On Fri 09-10-15 14:00:31, Nikolay Borisov wrote:
>> Hello mm people,
>>
>>
>> I want to ask you the following question which stemmed from analysing
>> and chasing this particular deadlock:
>> http://permalink.gmane.org/gmane.linux.kernel/2056730
>>
>> To summarise it:
>>
>> For simplicity I will use the following nomenclature:
>> t1 - kworker/u96:0
>> t2 - kworker/u98:39
>> t3 - kworker/u98:7
> 
> Could you be more specific about the trace of all three parties?
> I am not sure I am completely following your description. Thanks!

Hi,

Maybe you'd want to check this thread:
http://thread.gmane.org/gmane.linux.kernel/2056996

Essentially, in the beginning I thought the problem could be in the
memory manager but after discussing with Jan Kara I thought the problem
doesn't like in the memory manager.


> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Making per-cpu lists draining dependant on a flag
  2015-10-09 11:00 Making per-cpu lists draining dependant on a flag Nikolay Borisov
  2015-10-13 14:43 ` Michal Hocko
@ 2015-10-14  8:37 ` Michal Hocko
  2015-10-14  9:06   ` Nikolay Borisov
  1 sibling, 1 reply; 5+ messages in thread
From: Michal Hocko @ 2015-10-14  8:37 UTC (permalink / raw)
  To: Nikolay Borisov
  Cc: linux-mm, mgorman, Andrew Morton, Marian Marinov,
	SiteGround Operations, Jan Kara

On Fri 09-10-15 14:00:31, Nikolay Borisov wrote:
> Hello mm people,
> 
> 
> I want to ask you the following question which stemmed from analysing
> and chasing this particular deadlock:
> http://permalink.gmane.org/gmane.linux.kernel/2056730

This link doesn't seem to work properly for me. Could you post a
http://lkml.kernel.org/r/$msg_id link please?

> To summarise it:
> 
> For simplicity I will use the following nomenclature:
> t1 - kworker/u96:0
> t2 - kworker/u98:39
> t3 - kworker/u98:7
> 
> t1 issues drain_all_pages which generates IPI's, at the same time
> however,

OK, as per
http://lkml.kernel.org/r/1444318308-27560-1-git-send-email-kernel%40kyup.com
drain_all_pages is called from the __alloc_pages_nodemask called from
slab allocator. There is no stack leading to the allocation but then you
are saying

> t2 has already started doing async write of pages
> as part of its normal operation but is blocked upon t1 completion of
> its IPI (generated from drain_all_pages) since they both work on the
> same dm-thin volume.

which I read as the allocator is holding the same dm_bufio_lock, right?

> At the same time again, t3 is executing
> ext4_finish_bio, which disables interrupts, yet is dependent on t2
> completing its writes.

That would be a bug on its own because ext4_finish_bio seems to be
called from SoftIRQ context so it cannot wait for a regular scheduling
context. Whoever is holding that lock BH_Uptodate_Lock has to be in
(soft)IRQ context.

<found the original thread on linux-mm finally - the threading got
broken on the way>
http://lkml.kernel.org/r/20151013131453.GA1332%40quack.suse.cz

So Jack (CCed) thinks this is a non-atomic update of flags and that
indeed sounds plausible.

> But since it has disabled interrupts, it wont
> respond to t1's IPI and at this point a hard lock up occurs. This
> happens, since drain_all_pages calls on_each_cpu_mask with the last
> argument equal to  "true" meaning "wait until the ipi handler has
> finished", which of course will never happen in the described situation.
> 
> Based on that I was wondering whether avoiding such situation might
> merit making drain_all_pages invocation from
> __alloc_pages_direct_reclaim dependent on a particular GFP being passed
> e.g. GFP_NOPCPDRAIN or something along those lines?

I do not think so. Even if the dependency was real it would be a clear
deadlock even without drain_all_pages AFAICS.

> Alternatively would it be possible to make the IPI asycnrhonous e.g.
> calling on_each_cpu_mask with the last argument equal to false?

Strictly speaking the allocation path doesn't really depend on the sync
behavior. We are just trying to release pages on pcp lists and retry the
allocation. Even if the allocation context was faster than other CPUs
and fail the request then we would try again without triggering the OOM
because the reclaim has apparently made some progress.

Other callers might be more sensitive. Anyway this is called only if the
allocator issues a sleeping allocation request so I think that waiting
here is perfectly acceptable.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Making per-cpu lists draining dependant on a flag
  2015-10-14  8:37 ` Michal Hocko
@ 2015-10-14  9:06   ` Nikolay Borisov
  0 siblings, 0 replies; 5+ messages in thread
From: Nikolay Borisov @ 2015-10-14  9:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, mgorman, Andrew Morton, Marian Marinov,
	SiteGround Operations, Jan Kara



On 10/14/2015 11:37 AM, Michal Hocko wrote:
> On Fri 09-10-15 14:00:31, Nikolay Borisov wrote:
>> Hello mm people,
>>
>>
>> I want to ask you the following question which stemmed from analysing
>> and chasing this particular deadlock:
>> http://permalink.gmane.org/gmane.linux.kernel/2056730
> 
> This link doesn't seem to work properly for me. Could you post a
> http://lkml.kernel.org/r/$msg_id link please?
> 
>> To summarise it:
>>
>> For simplicity I will use the following nomenclature:
>> t1 - kworker/u96:0
>> t2 - kworker/u98:39
>> t3 - kworker/u98:7
>>
>> t1 issues drain_all_pages which generates IPI's, at the same time
>> however,
> 
> OK, as per
> http://lkml.kernel.org/r/1444318308-27560-1-git-send-email-kernel%40kyup.com
> drain_all_pages is called from the __alloc_pages_nodemask called from
> slab allocator. There is no stack leading to the allocation but then you
> are saying
> 
>> t2 has already started doing async write of pages
>> as part of its normal operation but is blocked upon t1 completion of
>> its IPI (generated from drain_all_pages) since they both work on the
>> same dm-thin volume.
> 
> which I read as the allocator is holding the same dm_bufio_lock, right?
> 
>> At the same time again, t3 is executing
>> ext4_finish_bio, which disables interrupts, yet is dependent on t2
>> completing its writes.
> 
> That would be a bug on its own because ext4_finish_bio seems to be
> called from SoftIRQ context so it cannot wait for a regular scheduling
> context. Whoever is holding that lock BH_Uptodate_Lock has to be in
> (soft)IRQ context.
> 
> <found the original thread on linux-mm finally - the threading got
> broken on the way>
> http://lkml.kernel.org/r/20151013131453.GA1332%40quack.suse.cz
> 
> So Jack (CCed) thinks this is a non-atomic update of flags and that
> indeed sounds plausible.
> 
>> But since it has disabled interrupts, it wont
>> respond to t1's IPI and at this point a hard lock up occurs. This
>> happens, since drain_all_pages calls on_each_cpu_mask with the last
>> argument equal to  "true" meaning "wait until the ipi handler has
>> finished", which of course will never happen in the described situation.
>>
>> Based on that I was wondering whether avoiding such situation might
>> merit making drain_all_pages invocation from
>> __alloc_pages_direct_reclaim dependent on a particular GFP being passed
>> e.g. GFP_NOPCPDRAIN or something along those lines?
> 
> I do not think so. Even if the dependency was real it would be a clear
> deadlock even without drain_all_pages AFAICS.
> 
>> Alternatively would it be possible to make the IPI asycnrhonous e.g.
>> calling on_each_cpu_mask with the last argument equal to false?
> 
> Strictly speaking the allocation path doesn't really depend on the sync
> behavior. We are just trying to release pages on pcp lists and retry the
> allocation. Even if the allocation context was faster than other CPUs
> and fail the request then we would try again without triggering the OOM
> because the reclaim has apparently made some progress.
> 
> Other callers might be more sensitive. Anyway this is called only if the
> allocator issues a sleeping allocation request so I think that waiting
> here is perfectly acceptable.

Thanks for taking the time to look over the issue. Indeed, I guess I
have been misled as to who the real culprit is, though the call traces
seemed to make the issue apparent. But kernel land seems to be a lot
more subtle :)

In any case I will test with Jack's patch and hopefully report that
everything is okay.

Nikolay

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-10-14  9:06 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-09 11:00 Making per-cpu lists draining dependant on a flag Nikolay Borisov
2015-10-13 14:43 ` Michal Hocko
2015-10-13 14:55   ` Nikolay Borisov
2015-10-14  8:37 ` Michal Hocko
2015-10-14  9:06   ` Nikolay Borisov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).