linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* arm64 flushing 255GB of vmalloc space takes too long
       [not found] <CAMPhdO-j5SfHexP8hafB2EQVs91TOqp_k_SLwWmo9OHVEvNWiQ@mail.gmail.com>
@ 2014-07-09 17:40 ` Catalin Marinas
  2014-07-09 18:04   ` Eric Miao
  0 siblings, 1 reply; 8+ messages in thread
From: Catalin Marinas @ 2014-07-09 17:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 09, 2014 at 05:53:26PM +0100, Eric Miao wrote:
> On Tue, Jul 8, 2014 at 6:43 PM, Laura Abbott <lauraa@codeaurora.org> wrote:
> > I have an arm64 target which has been observed hanging in __purge_vmap_area_lazy
> > in vmalloc.c The root cause of this 'hang' is that flush_tlb_kernel_range is
> > attempting to flush 255GB of virtual address space. This takes ~2 seconds and
> > preemption is disabled at this time thanks to the purge lock. Disabling
> > preemption for that time is long enough to trigger a watchdog we have setup.

That's definitely not good.

> > A couple of options I thought of:
> > 1) Increase the timeout of our watchdog to allow the flush to occur. Nobody
> > I suggested this to likes the idea as the watchdog firing generally catches
> > behavior that results in poor system performance and disabling preemption
> > for that long does seem like a problem.
> > 2) Change __purge_vmap_area_lazy to do less work under a spinlock. This would
> > certainly have a performance impact and I don't even know if it is plausible.
> > 3) Allow module unloading to trigger a vmalloc purge beforehand to help avoid
> > this case. This would still be racy if another vfree came in during the time
> > between the purge and the vfree but it might be good enough.
> > 4) Add 'if size > threshold flush entire tlb' (I haven't profiled this yet)
> 
> We have the same problem. I'd agree with point 2 and point 4, point 1/3 do not
> actually fix this issue. purge_vmap_area_lazy() could be called in other
> cases.

I would also discard point 2 as it still takes ~2 seconds, only that not
under a spinlock.

> w.r.t the threshold to flush entire tlb instead of doing that page-by-page, that
> could be different from platform to platform. And considering the cost of tlb
> flush on x86, I wonder why this isn't an issue on x86.

The current __purge_vmap_area_lazy() was done as an optimisation (commit
db64fe02258f1) to avoid IPIs. So flush_tlb_kernel_range() would only be
IPI'ed once.

IIUC, the problem is how start/end are computed in
__purge_vmap_area_lazy(), so even if you have only two vmap areas, if
they are 255GB apart you've got this problem.

One temporary option is to limit the vmalloc space on arm64 to something
like 2 x RAM-size (haven't looked at this yet). But if you get a
platform with lots of RAM, you hit this problem again.

Which leaves us with point (4) but finding the threshold is indeed
platform dependent. Another way could be a check for latency - so if it
took certain usecs, we break the loop and flush the whole TLB.

-- 
Catalin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* arm64 flushing 255GB of vmalloc space takes too long
  2014-07-09 17:40 ` arm64 flushing 255GB of vmalloc space takes too long Catalin Marinas
@ 2014-07-09 18:04   ` Eric Miao
  2014-07-11  1:26     ` Laura Abbott
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Miao @ 2014-07-09 18:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 9, 2014 at 10:40 AM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Wed, Jul 09, 2014 at 05:53:26PM +0100, Eric Miao wrote:
>> On Tue, Jul 8, 2014 at 6:43 PM, Laura Abbott <lauraa@codeaurora.org> wrote:
>> > I have an arm64 target which has been observed hanging in __purge_vmap_area_lazy
>> > in vmalloc.c The root cause of this 'hang' is that flush_tlb_kernel_range is
>> > attempting to flush 255GB of virtual address space. This takes ~2 seconds and
>> > preemption is disabled at this time thanks to the purge lock. Disabling
>> > preemption for that time is long enough to trigger a watchdog we have setup.
>
> That's definitely not good.
>
>> > A couple of options I thought of:
>> > 1) Increase the timeout of our watchdog to allow the flush to occur. Nobody
>> > I suggested this to likes the idea as the watchdog firing generally catches
>> > behavior that results in poor system performance and disabling preemption
>> > for that long does seem like a problem.
>> > 2) Change __purge_vmap_area_lazy to do less work under a spinlock. This would
>> > certainly have a performance impact and I don't even know if it is plausible.
>> > 3) Allow module unloading to trigger a vmalloc purge beforehand to help avoid
>> > this case. This would still be racy if another vfree came in during the time
>> > between the purge and the vfree but it might be good enough.
>> > 4) Add 'if size > threshold flush entire tlb' (I haven't profiled this yet)
>>
>> We have the same problem. I'd agree with point 2 and point 4, point 1/3 do not
>> actually fix this issue. purge_vmap_area_lazy() could be called in other
>> cases.
>
> I would also discard point 2 as it still takes ~2 seconds, only that not
> under a spinlock.
>

Point is - we could still end up a good amount of time in that function,
giving the default value of lazy_vfree_pages to be 32MB * log(ncpu),
worst case of all vmap areas being only one page, tlb flush page by
page, and traversal of the list, calling __free_vmap_area() that many
times won't likely to reduce the execution time to microsecond level.

If it's something inevitable - we do it in a bit cleaner way.

>> w.r.t the threshold to flush entire tlb instead of doing that page-by-page, that
>> could be different from platform to platform. And considering the cost of tlb
>> flush on x86, I wonder why this isn't an issue on x86.
>
> The current __purge_vmap_area_lazy() was done as an optimisation (commit
> db64fe02258f1) to avoid IPIs. So flush_tlb_kernel_range() would only be
> IPI'ed once.
>
> IIUC, the problem is how start/end are computed in
> __purge_vmap_area_lazy(), so even if you have only two vmap areas, if
> they are 255GB apart you've got this problem.

Indeed.

>
> One temporary option is to limit the vmalloc space on arm64 to something
> like 2 x RAM-size (haven't looked at this yet). But if you get a
> platform with lots of RAM, you hit this problem again.
>
> Which leaves us with point (4) but finding the threshold is indeed
> platform dependent. Another way could be a check for latency - so if it
> took certain usecs, we break the loop and flush the whole TLB.

Or we end up having platform specific tlb flush implementation just as we
did for cache ops. I would expect only few platforms will have their own
thresholds. A simple heuristic guess of the threshold based on number of
tlb entries would be good to go?

>
> --
> Catalin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* arm64 flushing 255GB of vmalloc space takes too long
  2014-07-09 18:04   ` Eric Miao
@ 2014-07-11  1:26     ` Laura Abbott
  2014-07-11 12:45       ` Catalin Marinas
  0 siblings, 1 reply; 8+ messages in thread
From: Laura Abbott @ 2014-07-11  1:26 UTC (permalink / raw)
  To: linux-arm-kernel

On 7/9/2014 11:04 AM, Eric Miao wrote:
> On Wed, Jul 9, 2014 at 10:40 AM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
>> On Wed, Jul 09, 2014 at 05:53:26PM +0100, Eric Miao wrote:
>>> On Tue, Jul 8, 2014 at 6:43 PM, Laura Abbott <lauraa@codeaurora.org> wrote:
>>>> I have an arm64 target which has been observed hanging in __purge_vmap_area_lazy
>>>> in vmalloc.c The root cause of this 'hang' is that flush_tlb_kernel_range is
>>>> attempting to flush 255GB of virtual address space. This takes ~2 seconds and
>>>> preemption is disabled at this time thanks to the purge lock. Disabling
>>>> preemption for that time is long enough to trigger a watchdog we have setup.
>>
>> That's definitely not good.
>>
>>>> A couple of options I thought of:
>>>> 1) Increase the timeout of our watchdog to allow the flush to occur. Nobody
>>>> I suggested this to likes the idea as the watchdog firing generally catches
>>>> behavior that results in poor system performance and disabling preemption
>>>> for that long does seem like a problem.
>>>> 2) Change __purge_vmap_area_lazy to do less work under a spinlock. This would
>>>> certainly have a performance impact and I don't even know if it is plausible.
>>>> 3) Allow module unloading to trigger a vmalloc purge beforehand to help avoid
>>>> this case. This would still be racy if another vfree came in during the time
>>>> between the purge and the vfree but it might be good enough.
>>>> 4) Add 'if size > threshold flush entire tlb' (I haven't profiled this yet)
>>>
>>> We have the same problem. I'd agree with point 2 and point 4, point 1/3 do not
>>> actually fix this issue. purge_vmap_area_lazy() could be called in other
>>> cases.
>>
>> I would also discard point 2 as it still takes ~2 seconds, only that not
>> under a spinlock.
>>
> 
> Point is - we could still end up a good amount of time in that function,
> giving the default value of lazy_vfree_pages to be 32MB * log(ncpu),
> worst case of all vmap areas being only one page, tlb flush page by
> page, and traversal of the list, calling __free_vmap_area() that many
> times won't likely to reduce the execution time to microsecond level.
> 
> If it's something inevitable - we do it in a bit cleaner way.
> 
>>> w.r.t the threshold to flush entire tlb instead of doing that page-by-page, that
>>> could be different from platform to platform. And considering the cost of tlb
>>> flush on x86, I wonder why this isn't an issue on x86.
>>
>> The current __purge_vmap_area_lazy() was done as an optimisation (commit
>> db64fe02258f1) to avoid IPIs. So flush_tlb_kernel_range() would only be
>> IPI'ed once.
>>
>> IIUC, the problem is how start/end are computed in
>> __purge_vmap_area_lazy(), so even if you have only two vmap areas, if
>> they are 255GB apart you've got this problem.
> 
> Indeed.
> 
>>
>> One temporary option is to limit the vmalloc space on arm64 to something
>> like 2 x RAM-size (haven't looked at this yet). But if you get a
>> platform with lots of RAM, you hit this problem again.
>>
>> Which leaves us with point (4) but finding the threshold is indeed
>> platform dependent. Another way could be a check for latency - so if it
>> took certain usecs, we break the loop and flush the whole TLB.
> 
> Or we end up having platform specific tlb flush implementation just as we
> did for cache ops. I would expect only few platforms will have their own
> thresholds. A simple heuristic guess of the threshold based on number of
> tlb entries would be good to go?
> 

Mark Salter actually proposed a fix to this back in May 

https://lkml.org/lkml/2014/5/2/311

I never saw any further comments on it though. It also matches what x86
does with their TLB flushing. It fixes the problem for me and the threshold
seems to be the best we can do unless we want to introduce options per
platform. It will need to be rebased to the latest tree though.

Thanks,
Laura

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 8+ messages in thread

* arm64 flushing 255GB of vmalloc space takes too long
  2014-07-11  1:26     ` Laura Abbott
@ 2014-07-11 12:45       ` Catalin Marinas
  2014-07-23 21:25         ` Mark Salter
  0 siblings, 1 reply; 8+ messages in thread
From: Catalin Marinas @ 2014-07-11 12:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jul 11, 2014 at 02:26:48AM +0100, Laura Abbott wrote:
> On 7/9/2014 11:04 AM, Eric Miao wrote:
> > On Wed, Jul 9, 2014 at 10:40 AM, Catalin Marinas
> > <catalin.marinas@arm.com> wrote:
> >> On Wed, Jul 09, 2014 at 05:53:26PM +0100, Eric Miao wrote:
> >>> On Tue, Jul 8, 2014 at 6:43 PM, Laura Abbott <lauraa@codeaurora.org> wrote:
> >>>> I have an arm64 target which has been observed hanging in __purge_vmap_area_lazy
> >>>> in vmalloc.c The root cause of this 'hang' is that flush_tlb_kernel_range is
> >>>> attempting to flush 255GB of virtual address space. This takes ~2 seconds and
> >>>> preemption is disabled at this time thanks to the purge lock. Disabling
> >>>> preemption for that time is long enough to trigger a watchdog we have setup.
> >>
> >> That's definitely not good.
> >>
> >>>> A couple of options I thought of:
> >>>> 1) Increase the timeout of our watchdog to allow the flush to occur. Nobody
> >>>> I suggested this to likes the idea as the watchdog firing generally catches
> >>>> behavior that results in poor system performance and disabling preemption
> >>>> for that long does seem like a problem.
> >>>> 2) Change __purge_vmap_area_lazy to do less work under a spinlock. This would
> >>>> certainly have a performance impact and I don't even know if it is plausible.
> >>>> 3) Allow module unloading to trigger a vmalloc purge beforehand to help avoid
> >>>> this case. This would still be racy if another vfree came in during the time
> >>>> between the purge and the vfree but it might be good enough.
> >>>> 4) Add 'if size > threshold flush entire tlb' (I haven't profiled this yet)
> >>>
> >>> We have the same problem. I'd agree with point 2 and point 4, point 1/3 do not
> >>> actually fix this issue. purge_vmap_area_lazy() could be called in other
> >>> cases.
> >>
> >> I would also discard point 2 as it still takes ~2 seconds, only that not
> >> under a spinlock.
> > 
> > Point is - we could still end up a good amount of time in that function,
> > giving the default value of lazy_vfree_pages to be 32MB * log(ncpu),
> > worst case of all vmap areas being only one page, tlb flush page by
> > page, and traversal of the list, calling __free_vmap_area() that many
> > times won't likely to reduce the execution time to microsecond level.
> > 
> > If it's something inevitable - we do it in a bit cleaner way.

In general I think it makes sense to add a mutex instead of a spinlock
here if slowdown is caused by other things as well. That's independent
of the TLB invalidation optimisation for arm64.

> > Or we end up having platform specific tlb flush implementation just as we
> > did for cache ops. I would expect only few platforms will have their own
> > thresholds. A simple heuristic guess of the threshold based on number of
> > tlb entries would be good to go?
> 
> Mark Salter actually proposed a fix to this back in May 
> 
> https://lkml.org/lkml/2014/5/2/311
> 
> I never saw any further comments on it though. It also matches what x86
> does with their TLB flushing. It fixes the problem for me and the threshold
> seems to be the best we can do unless we want to introduce options per
> platform. It will need to be rebased to the latest tree though.

There were other patches in this area and I forgot about this. The
problem is that the ARM architecture does not define the actual
micro-architectural implementation of the TLBs (and it shouldn't), so
there is no way to guess how many TLB entries there are. It's not an
easy figure to get either since there are multiple levels of caching for
the TLBs.

So we either guess some value here (we may not always be optimal) or we
put some time bound (e.g. based on sched_clock()) on how long to loop.
The latter is not optimal either, the only aim being to avoid
soft-lockups.

-- 
Catalin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* arm64 flushing 255GB of vmalloc space takes too long
  2014-07-11 12:45       ` Catalin Marinas
@ 2014-07-23 21:25         ` Mark Salter
  2014-07-24 14:24           ` Catalin Marinas
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Salter @ 2014-07-23 21:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2014-07-11 at 13:45 +0100, Catalin Marinas wrote:
> On Fri, Jul 11, 2014 at 02:26:48AM +0100, Laura Abbott wrote:
> > On 7/9/2014 11:04 AM, Eric Miao wrote:
> > > On Wed, Jul 9, 2014 at 10:40 AM, Catalin Marinas
> > > <catalin.marinas@arm.com> wrote:
> > >> On Wed, Jul 09, 2014 at 05:53:26PM +0100, Eric Miao wrote:
> > >>> On Tue, Jul 8, 2014 at 6:43 PM, Laura Abbott <lauraa@codeaurora.org> wrote:
> > >>>> I have an arm64 target which has been observed hanging in __purge_vmap_area_lazy
> > >>>> in vmalloc.c The root cause of this 'hang' is that flush_tlb_kernel_range is
> > >>>> attempting to flush 255GB of virtual address space. This takes ~2 seconds and
> > >>>> preemption is disabled at this time thanks to the purge lock. Disabling
> > >>>> preemption for that time is long enough to trigger a watchdog we have setup.
> > >>
> > >> That's definitely not good.
> > >>
> > >>>> A couple of options I thought of:
> > >>>> 1) Increase the timeout of our watchdog to allow the flush to occur. Nobody
> > >>>> I suggested this to likes the idea as the watchdog firing generally catches
> > >>>> behavior that results in poor system performance and disabling preemption
> > >>>> for that long does seem like a problem.
> > >>>> 2) Change __purge_vmap_area_lazy to do less work under a spinlock. This would
> > >>>> certainly have a performance impact and I don't even know if it is plausible.
> > >>>> 3) Allow module unloading to trigger a vmalloc purge beforehand to help avoid
> > >>>> this case. This would still be racy if another vfree came in during the time
> > >>>> between the purge and the vfree but it might be good enough.
> > >>>> 4) Add 'if size > threshold flush entire tlb' (I haven't profiled this yet)
> > >>>
> > >>> We have the same problem. I'd agree with point 2 and point 4, point 1/3 do not
> > >>> actually fix this issue. purge_vmap_area_lazy() could be called in other
> > >>> cases.
> > >>
> > >> I would also discard point 2 as it still takes ~2 seconds, only that not
> > >> under a spinlock.
> > > 
> > > Point is - we could still end up a good amount of time in that function,
> > > giving the default value of lazy_vfree_pages to be 32MB * log(ncpu),
> > > worst case of all vmap areas being only one page, tlb flush page by
> > > page, and traversal of the list, calling __free_vmap_area() that many
> > > times won't likely to reduce the execution time to microsecond level.
> > > 
> > > If it's something inevitable - we do it in a bit cleaner way.
> 
> In general I think it makes sense to add a mutex instead of a spinlock
> here if slowdown is caused by other things as well. That's independent
> of the TLB invalidation optimisation for arm64.
> 
> > > Or we end up having platform specific tlb flush implementation just as we
> > > did for cache ops. I would expect only few platforms will have their own
> > > thresholds. A simple heuristic guess of the threshold based on number of
> > > tlb entries would be good to go?
> > 
> > Mark Salter actually proposed a fix to this back in May 
> > 
> > https://lkml.org/lkml/2014/5/2/311
> > 
> > I never saw any further comments on it though. It also matches what x86
> > does with their TLB flushing. It fixes the problem for me and the threshold
> > seems to be the best we can do unless we want to introduce options per
> > platform. It will need to be rebased to the latest tree though.
> 
> There were other patches in this area and I forgot about this. The
> problem is that the ARM architecture does not define the actual
> micro-architectural implementation of the TLBs (and it shouldn't), so
> there is no way to guess how many TLB entries there are. It's not an
> easy figure to get either since there are multiple levels of caching for
> the TLBs.
> 
> So we either guess some value here (we may not always be optimal) or we
> put some time bound (e.g. based on sched_clock()) on how long to loop.
> The latter is not optimal either, the only aim being to avoid
> soft-lockups.
> 

Sorry for the late reply...

So, what would you like to see wrt this, Catalin? A reworked patch based
on time? IMO, something based on loop count or time seems better than
the status quo of a CPU potentially wasting 10s of seconds flushing the
tlb.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* arm64 flushing 255GB of vmalloc space takes too long
  2014-07-23 21:25         ` Mark Salter
@ 2014-07-24 14:24           ` Catalin Marinas
  2014-07-24 14:56             ` [PATCH] arm64: fix soft lockup due to large tlb flush range Mark Salter
  0 siblings, 1 reply; 8+ messages in thread
From: Catalin Marinas @ 2014-07-24 14:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jul 23, 2014 at 10:25:34PM +0100, Mark Salter wrote:
> On Fri, 2014-07-11 at 13:45 +0100, Catalin Marinas wrote:
> > On Fri, Jul 11, 2014 at 02:26:48AM +0100, Laura Abbott wrote:
> > > Mark Salter actually proposed a fix to this back in May 
> > > 
> > > https://lkml.org/lkml/2014/5/2/311
> > > 
> > > I never saw any further comments on it though. It also matches what x86
> > > does with their TLB flushing. It fixes the problem for me and the threshold
> > > seems to be the best we can do unless we want to introduce options per
> > > platform. It will need to be rebased to the latest tree though.
> > 
> > There were other patches in this area and I forgot about this. The
> > problem is that the ARM architecture does not define the actual
> > micro-architectural implementation of the TLBs (and it shouldn't), so
> > there is no way to guess how many TLB entries there are. It's not an
> > easy figure to get either since there are multiple levels of caching for
> > the TLBs.
> > 
> > So we either guess some value here (we may not always be optimal) or we
> > put some time bound (e.g. based on sched_clock()) on how long to loop.
> > The latter is not optimal either, the only aim being to avoid
> > soft-lockups.
> 
> Sorry for the late reply...
> 
> So, what would you like to see wrt this, Catalin? A reworked patch based
> on time? IMO, something based on loop count or time seems better than
> the status quo of a CPU potentially wasting 10s of seconds flushing the
> tlb.

I think we could go with a loop for simplicity but with a larger number
of iterations only to avoid the lock-up (e.g. 1024, this would be 4MB
range). My concern is that for a few global mappings that may or may not
be in the TLB we nuke both the L1 and L2 TLBs (the latter can have over
1K entries). As for optimisation, I think we should look at the original
code generating such big ranges.

Would you mind posting a patch against the latest kernel?

-- 
Catalin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] arm64: fix soft lockup due to large tlb flush range
  2014-07-24 14:24           ` Catalin Marinas
@ 2014-07-24 14:56             ` Mark Salter
  2014-07-24 17:47               ` Catalin Marinas
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Salter @ 2014-07-24 14:56 UTC (permalink / raw)
  To: linux-arm-kernel

Under certain loads, this soft lockup has been observed:

   BUG: soft lockup - CPU#2 stuck for 22s! [ip6tables:1016]
   Modules linked in: ip6t_rpfilter ip6t_REJECT cfg80211 rfkill xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw vfat fat efivarfs xfs libcrc32c

   CPU: 2 PID: 1016 Comm: ip6tables Not tainted 3.13.0-0.rc7.30.sa2.aarch64 #1
   task: fffffe03e81d1400 ti: fffffe03f01f8000 task.ti: fffffe03f01f8000
   PC is at __cpu_flush_kern_tlb_range+0xc/0x40
   LR is at __purge_vmap_area_lazy+0x28c/0x3ac
   pc : [<fffffe000009c5cc>] lr : [<fffffe0000182710>] pstate: 80000145
   sp : fffffe03f01fbb70
   x29: fffffe03f01fbb70 x28: fffffe03f01f8000
   x27: fffffe0000b19000 x26: 00000000000000d0
   x25: 000000000000001c x24: fffffe03f01fbc50
   x23: fffffe03f01fbc58 x22: fffffe03f01fbc10
   x21: fffffe0000b2a3f8 x20: 0000000000000802
   x19: fffffe0000b2a3c8 x18: 000003fffdf52710
   x17: 000003ff9d8bb910 x16: fffffe000050fbfc
   x15: 0000000000005735 x14: 000003ff9d7e1a5c
   x13: 0000000000000000 x12: 000003ff9d7e1a5c
   x11: 0000000000000007 x10: fffffe0000c09af0
   x9 : fffffe0000ad1000 x8 : 000000000000005c
   x7 : fffffe03e8624000 x6 : 0000000000000000
   x5 : 0000000000000000 x4 : 0000000000000000
   x3 : fffffe0000c09cc8 x2 : 0000000000000000
   x1 : 000fffffdfffca80 x0 : 000fffffcd742150

The __cpu_flush_kern_tlb_range() function looks like:

  ENTRY(__cpu_flush_kern_tlb_range)
	dsb	sy
	lsr	x0, x0, #12
	lsr	x1, x1, #12
  1:	tlbi	vaae1is, x0
	add	x0, x0, #1
	cmp	x0, x1
	b.lo	1b
	dsb	sy
	isb
	ret
  ENDPROC(__cpu_flush_kern_tlb_range)

The above soft lockup shows the PC at tlbi insn with:

  x0 = 0x000fffffcd742150
  x1 = 0x000fffffdfffca80

So __cpu_flush_kern_tlb_range has 0x128ba930 tlbi flushes left
after it has already been looping for 23 seconds!.

Looking up one frame at __purge_vmap_area_lazy(), there is:

	...
	list_for_each_entry_rcu(va, &vmap_area_list, list) {
		if (va->flags & VM_LAZY_FREE) {
			if (va->va_start < *start)
				*start = va->va_start;
			if (va->va_end > *end)
				*end = va->va_end;
			nr += (va->va_end - va->va_start) >> PAGE_SHIFT;
			list_add_tail(&va->purge_list, &valist);
			va->flags |= VM_LAZY_FREEING;
			va->flags &= ~VM_LAZY_FREE;
		}
	}
	...
	if (nr || force_flush)
		flush_tlb_kernel_range(*start, *end);

So if two areas are being freed, the range passed to
flush_tlb_kernel_range() may be as large as the vmalloc
space. For arm64, this is ~240GB for 4k pagesize and ~2TB
for 64kpage size.

This patch works around this problem by adding a loop limit.
If the range is larger than the limit, use flush_tlb_all()
rather than flushing based on individual pages. The limit
chosen is arbitrary and would be better if based on the
actual size of the tlb. I looked through the ARM ARM but
didn't see any easy way to get the actual tlb size, so for
now the arbitrary limit is better than the soft lockup.

Signed-off-by: Mark Salter <msalter@redhat.com>
---
 arch/arm64/include/asm/tlbflush.h | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index b9349c4..af3e572 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -98,8 +98,8 @@ static inline void flush_tlb_page(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
-static inline void flush_tlb_range(struct vm_area_struct *vma,
-					unsigned long start, unsigned long end)
+static inline void __flush_tlb_range(struct vm_area_struct *vma,
+				     unsigned long start, unsigned long end)
 {
 	unsigned long asid = (unsigned long)ASID(vma->vm_mm) << 48;
 	unsigned long addr;
@@ -112,7 +112,9 @@ static inline void flush_tlb_range(struct vm_area_struct *vma,
 	dsb(ish);
 }
 
-static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+#define MAX_TLB_LOOP 1024
+
+static inline void __flush_tlb_kernel_range(unsigned long start, unsigned long end)
 {
 	unsigned long addr;
 	start >>= 12;
@@ -124,6 +126,23 @@ static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end
 	dsb(ish);
 }
 
+static inline void flush_tlb_range(struct vm_area_struct *vma,
+				   unsigned long start, unsigned long end)
+{
+	if (((end - start) >> PAGE_SHIFT) < MAX_TLB_LOOP)
+		__flush_tlb_range(vma, start, end);
+	else
+		flush_tlb_mm(vma->vm_mm);
+}
+
+static inline void flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+	if (((end - start) >> PAGE_SHIFT) < MAX_TLB_LOOP)
+		__flush_tlb_kernel_range(start, end);
+	else
+		flush_tlb_all();
+}
+
 /*
  * On AArch64, the cache coherency is handled via the set_pte_at() function.
  */
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH] arm64: fix soft lockup due to large tlb flush range
  2014-07-24 14:56             ` [PATCH] arm64: fix soft lockup due to large tlb flush range Mark Salter
@ 2014-07-24 17:47               ` Catalin Marinas
  0 siblings, 0 replies; 8+ messages in thread
From: Catalin Marinas @ 2014-07-24 17:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Jul 24, 2014 at 03:56:15PM +0100, Mark Salter wrote:
> Under certain loads, this soft lockup has been observed:
> 
>    BUG: soft lockup - CPU#2 stuck for 22s! [ip6tables:1016]
>    Modules linked in: ip6t_rpfilter ip6t_REJECT cfg80211 rfkill xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw vfat fat efivarfs xfs libcrc32c

Merged (with minor tweaks, comment added). Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-07-24 17:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAMPhdO-j5SfHexP8hafB2EQVs91TOqp_k_SLwWmo9OHVEvNWiQ@mail.gmail.com>
2014-07-09 17:40 ` arm64 flushing 255GB of vmalloc space takes too long Catalin Marinas
2014-07-09 18:04   ` Eric Miao
2014-07-11  1:26     ` Laura Abbott
2014-07-11 12:45       ` Catalin Marinas
2014-07-23 21:25         ` Mark Salter
2014-07-24 14:24           ` Catalin Marinas
2014-07-24 14:56             ` [PATCH] arm64: fix soft lockup due to large tlb flush range Mark Salter
2014-07-24 17:47               ` Catalin Marinas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).