All of lore.kernel.org
 help / color / mirror / Atom feed
* Page fault while link_path_walk for path_len > 4060 bytes
@ 2017-08-07  6:22 ankijain
  2017-08-07 12:18 ` ankijain
  0 siblings, 1 reply; 11+ messages in thread
From: ankijain @ 2017-08-07  6:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: stummala, rharjani

Hi

We are facing a issue while creating the directories up-to maximum 
depth.
While doing this, page fault occurs some times if PATH_LEN is near to 
page boundary.
During hash_name() calculation inside link_path_walk(), 
load_unaligned_zeropad() is causing page fault as it tries to access 
next page data(which is not mapped).
We have already taken a lock(rcu_read_lock) in path_init() and as a part 
of page fault it checks if process can sleep (might_sleep), which 
results in panic as rcu_read_lock is already taken in path_init().

One observation : Panic happens only when page(which contain the 
path_name and length > 4060) is the last page of the slab memory.

Test Case detail:
Kernel version : 4.4
We are following stress-ng test.
https://github.com/ColinIanKing/stress-ng

1. dirdeep for creating directory upto maximum depth.
2. vm for memory allocation in backgroud.

Above 2 test cases are running simultaneously and it takes ~3-4 hours to 
reproduce this issue.

Pasting comment section for load_unaligned_zeropad():
/*
   * Load an unaligned word from kernel space.
   *
   * In the (very unlikely) case of the word being a page-crosser
   * and the next page not being mapped, take the exception and
   * return zeroes in the non-existing part.
   */

Wanted to confirm if this "very unlikely case" mentioned above is 
pointing to the same issue ?
Also let us know if this is a known issue and if some fix is already 
available. If not, please suggest how we can proceed further.

Appreciate your help. Thanks in advance!!

Regards,
Ankit Jain
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, 
Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Page fault while link_path_walk for path_len > 4060 bytes
  2017-08-07  6:22 Page fault while link_path_walk for path_len > 4060 bytes ankijain
@ 2017-08-07 12:18 ` ankijain
  2017-08-14 11:14   ` ankijain
  0 siblings, 1 reply; 11+ messages in thread
From: ankijain @ 2017-08-07 12:18 UTC (permalink / raw)
  To: linux-fsdevel

On 2017-08-07 11:52, ankijain@codeaurora.org wrote:
> Hi
> 
> We are facing a issue while creating the directories up-to maximum 
> depth.
> While doing this, page fault occurs some times if PATH_LEN is near to
> page boundary.
> During hash_name() calculation inside link_path_walk(),
> load_unaligned_zeropad() is causing page fault as it tries to access
> next page data(which is not mapped).
> We have already taken a lock(rcu_read_lock) in path_init() and as a
> part of page fault it checks if process can sleep (might_sleep), which
> results in panic as rcu_read_lock is already taken in path_init().
> 
> One observation : Panic happens only when page(which contain the
> path_name and length > 4060) is the last page of the slab memory.
> 
> Test Case detail:
> Kernel version : 4.4
> We are following stress-ng test.
> https://github.com/ColinIanKing/stress-ng
> 
> 1. dirdeep for creating directory upto maximum depth.
> 2. vm for memory allocation in backgroud.
> 
> Above 2 test cases are running simultaneously and it takes ~3-4 hours
> to reproduce this issue.
> 
> Pasting comment section for load_unaligned_zeropad():
> /*
>   * Load an unaligned word from kernel space.
>   *
>   * In the (very unlikely) case of the word being a page-crosser
>   * and the next page not being mapped, take the exception and
>   * return zeroes in the non-existing part.
>   */
> 
> Wanted to confirm if this "very unlikely case" mentioned above is
> pointing to the same issue ?
> Also let us know if this is a known issue and if some fix is already
> available. If not, please suggest how we can proceed further.
> 
> Appreciate your help. Thanks in advance!!
> 
> Regards,
> Ankit Jain
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation 
> Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
> Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Page fault while link_path_walk for path_len > 4060 bytes
  2017-08-07 12:18 ` ankijain
@ 2017-08-14 11:14   ` ankijain
       [not found]     ` <953068e79da559bfd4f13e46e31c5a4e@codeaurora.org>
  0 siblings, 1 reply; 11+ messages in thread
From: ankijain @ 2017-08-14 11:14 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: linux-fsdevel-owner

Hi linux-fsdevel

could you please look at this issue or suggest how we can proceed 
further.

Thanks,

On 2017-08-07 17:48, ankijain@codeaurora.org wrote:
> On 2017-08-07 11:52, ankijain@codeaurora.org wrote:
>> Hi
>> 
>> We are facing a issue while creating the directories up-to maximum 
>> depth.
>> While doing this, page fault occurs some times if PATH_LEN is near to
>> page boundary.
>> During hash_name() calculation inside link_path_walk(),
>> load_unaligned_zeropad() is causing page fault as it tries to access
>> next page data(which is not mapped).
>> We have already taken a lock(rcu_read_lock) in path_init() and as a
>> part of page fault it checks if process can sleep (might_sleep), which
>> results in panic as rcu_read_lock is already taken in path_init().
>> 
>> One observation : Panic happens only when page(which contain the
>> path_name and length > 4060) is the last page of the slab memory.
>> 
>> Test Case detail:
>> Kernel version : 4.4
>> We are following stress-ng test.
>> https://github.com/ColinIanKing/stress-ng
>> 
>> 1. dirdeep for creating directory upto maximum depth.
>> 2. vm for memory allocation in backgroud.
>> 
>> Above 2 test cases are running simultaneously and it takes ~3-4 hours
>> to reproduce this issue.
>> 
>> Pasting comment section for load_unaligned_zeropad():
>> /*
>>   * Load an unaligned word from kernel space.
>>   *
>>   * In the (very unlikely) case of the word being a page-crosser
>>   * and the next page not being mapped, take the exception and
>>   * return zeroes in the non-existing part.
>>   */
>> 
>> Wanted to confirm if this "very unlikely case" mentioned above is
>> pointing to the same issue ?
>> Also let us know if this is a known issue and if some fix is already
>> available. If not, please suggest how we can proceed further.
>> 
>> Appreciate your help. Thanks in advance!!
>> 
>> Regards,
>> Ankit Jain
>> Qualcomm India Private Limited, on behalf of Qualcomm Innovation 
>> Center, Inc.
>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
>> Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Page fault while link_path_walk for path_len > 4060 bytes
       [not found]     ` <953068e79da559bfd4f13e46e31c5a4e@codeaurora.org>
@ 2017-08-22 12:57       ` Will Deacon
       [not found]         ` <b13b3a27e92c5413d168ad775163ea91@codeaurora.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Will Deacon @ 2017-08-22 12:57 UTC (permalink / raw)
  To: linux-arm-kernel

Ankit,

On Tue, Aug 22, 2017 at 11:19:43AM +0530, ankijain at codeaurora.org wrote:
> Could you please look into this issue?
> Can we use srcu_read_lock/srcu_read_unlock instead of
> rcu_read_lock/rcu_read_unlock in path_init() ,terminate_Walk() and
> complete_walk() ?
> if we go for srcu_read_lock, then how can we maintain the idx for
> srcu_read_unlock in unlazy_walk().
> 
> Requesting you please comment on this issue.

It would help if you could:

  1. Provide the kernel panic log
  2. Describe the machine on which you're running
  3. Run the repro code on a mainline kernel

Assuming this is an arm64 machine, I'm intrigued as to what is generating
the fault. v4.4 doesn't support DEBUG_PAGEALLOC, so it must be something
else. Do you know what the ESR value is when you take the fault?

Does the diff below help?

Will

--->8

diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 2509e4fe6992..aa6c4a9e1113 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -611,7 +611,7 @@ static const struct fault_info fault_info[] = {
 	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"level 0 translation fault"	},
 	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"level 1 translation fault"	},
 	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"level 2 translation fault"	},
-	{ do_page_fault,	SIGSEGV, SEGV_MAPERR,	"level 3 translation fault"	},
+	{ do_translation_fault,	SIGSEGV, SEGV_MAPERR,	"level 3 translation fault"	},
 	{ do_bad,		SIGBUS,  0,		"unknown 8"			},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 1 access flag fault"	},
 	{ do_page_fault,	SIGSEGV, SEGV_ACCERR,	"level 2 access flag fault"	},


> On 2017-08-14 16:44, ankijain at codeaurora.org wrote:
> >Hi linux-fsdevel
> >
> >could you please look at this issue or suggest how we can proceed further.
> >
> >Thanks,
> >
> >On 2017-08-07 17:48, ankijain at codeaurora.org wrote:
> >>On 2017-08-07 11:52, ankijain at codeaurora.org wrote:
> >>>Hi
> >>>
> >>>We are facing a issue while creating the directories up-to maximum
> >>>depth.
> >>>While doing this, page fault occurs some times if PATH_LEN is near to
> >>>page boundary.
> >>>During hash_name() calculation inside link_path_walk(),
> >>>load_unaligned_zeropad() is causing page fault as it tries to access
> >>>next page data(which is not mapped).
> >>>We have already taken a lock(rcu_read_lock) in path_init() and as a
> >>>part of page fault it checks if process can sleep (might_sleep), which
> >>>results in panic as rcu_read_lock is already taken in path_init().
> >>>
> >>>One observation : Panic happens only when page(which contain the
> >>>path_name and length > 4060) is the last page of the slab memory.
> >>>
> >>>Test Case detail:
> >>>Kernel version : 4.4
> >>>We are following stress-ng test.
> >>>https://github.com/ColinIanKing/stress-ng
> >>>
> >>>1. dirdeep for creating directory upto maximum depth.
> >>>2. vm for memory allocation in backgroud.
> >>>
> >>>Above 2 test cases are running simultaneously and it takes ~3-4 hours
> >>>to reproduce this issue.
> >>>
> >>>Pasting comment section for load_unaligned_zeropad():
> >>>/*
> >>>  * Load an unaligned word from kernel space.
> >>>  *
> >>>  * In the (very unlikely) case of the word being a page-crosser
> >>>  * and the next page not being mapped, take the exception and
> >>>  * return zeroes in the non-existing part.
> >>>  */
> >>>
> >>>Wanted to confirm if this "very unlikely case" mentioned above is
> >>>pointing to the same issue ?
> >>>Also let us know if this is a known issue and if some fix is already
> >>>available. If not, please suggest how we can proceed further.
> >>>
> >>>Appreciate your help. Thanks in advance!!
> >>>
> >>>Regards,
> >>>Ankit Jain
> >>>Qualcomm India Private Limited, on behalf of Qualcomm Innovation
> >>>Center, Inc.
> >>>Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
> >>>Linux Foundation Collaborative Project

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Page fault while link_path_walk for path_len > 4060 bytes
       [not found]         ` <b13b3a27e92c5413d168ad775163ea91@codeaurora.org>
@ 2017-08-28  6:20           ` Al Viro
       [not found]             ` <c4256a77288fd5f275a45bc73c9c78f0@codeaurora.org>
  2017-09-12 20:26           ` Will Deacon
  1 sibling, 1 reply; 11+ messages in thread
From: Al Viro @ 2017-08-28  6:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Aug 28, 2017 at 09:53:00AM +0530, ankijain at codeaurora.org wrote:
> Hi Will Deacon/ Al viro
> 
> 
> -->Please find the attached kmsg.txt
> <3>[17620.275249] BUG: sleeping function called from invalid context at /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/arch/arm64/mm/fault.c:313
> <3>[17620.276504] in_atomic(): 0, irqs_disabled(): 0, pid: 10290, name:
> stress-ng-dirde
> <6>[17620.298995] ------------[ cut here ]------------
> <2>[17620.299009] kernel BUG at /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528!
> <6>[17620.306372] ------------[ cut here ]------------
> <2>[17620.327239] kernel BUG at /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528!
> 
> 
> --> we are using arm64 machine with kernel 4.4.
> --> can you please guide us, how to capture ESR value while taking the
> fault?
> -->
> -    { do_page_fault,    SIGSEGV, SEGV_MAPERR,    "level 3 translation
> fault"    },
> +    { do_translation_fault,    SIGSEGV, SEGV_MAPERR,    "level 3
> translation fault"    },
> we will try with above changes and get back to you.
> 
> -> config and kmsg are attached.
> 
> Regards,
> Ankit Jain
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation
> Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
> Linux Foundation Collaborative Project

Umm...  Line numbers make no sense for 4.4.  Could you post a reference
to the actual tree used (repository + SHA1; again, it can't be vanilla
4.4, or stable/linux-4.4.y, for that matter) as well as your .config?

In any case, looks like in_atomic() is false there, so we need an explicit
pagefault_disable() to make sure it goes to no_context.

Looking through the callchains...
	* __d_lookup() -> d_same_name() -> dentry_cmp() -> dentry_string_cmp()
with rcu_read_lock() held by __d_lookup().
	* d_alloc_parallel() -> d_same_name(), etc.  rcu_read_lock() held by
d_alloc_parallel() in one case, dentry->d_lock in another.
	* d_exact_alias() -> d_same_name().  inode->i_lock held by d_exact_alias().
	* d_alloc_parallel() -> __d_lookup_rcu() -> dentry_cmp().
rcu_read_lock() held by d_alloc_parallel().
	* lookup_fast() -> __d_lookup_rcu(), etc.  rcu_read_lock() grabbed by
path_init().
	* full_name_hash().  Fuckloads.
	* hashlen_string().  Fewer, but...
	* link_path_walk() -> hash_name().  rcu_read_lock() held by path_init().

And then there's siphash(), but that one AFAICS should never see those faults.

Hell knows...  I'm somewhat tempted to slap pagefault_disable()/pagefault_enable()
in dentry_string_cmp(), full_name_hash(), hashlen_string() and hash_name().
Regardless of the locks held by callers.  Doing that in load_unaligned_zeropad()
itself would be ridiculously costly, but these 4 would probably be saner...

I still would like to see the details of config, though.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Page fault while link_path_walk for path_len > 4060 bytes
       [not found]             ` <c4256a77288fd5f275a45bc73c9c78f0@codeaurora.org>
@ 2017-09-11  3:14               ` ankijain at codeaurora.org
  0 siblings, 0 replies; 11+ messages in thread
From: ankijain at codeaurora.org @ 2017-09-11  3:14 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Al Viro

Could you please reply on below query.

Are below error messages pointing to an issue which we can face later if 
we remove force panic?
http://elixir.free-electrons.com/linux/v4.4.76/source/kernel/sched/core.c#L7605
http://elixir.free-electrons.com/linux/v4.4.76/source/kernel/sched/core.c#L7608

Regards,
Ankit Jain

On 2017-08-30 22:49, ankijain at codeaurora.org wrote:
> Hi Al Viro
> 
> Thanks for replying.
> 
> We are using AOSP project tree.
> You can refer http://elixir.free-electrons.com/linux/v4.4.76/source.
> 
> http://elixir.free-electrons.com/linux/v4.4.76/source/arch/arm64/mm/fault.c#L302
>   (might_sleep())
> 
> http://elixir.free-electrons.com/linux/v4.4.76/source/kernel/sched/core.c#L7592
>  (___might_sleep())
> 
> Panic is added forcefully in our code after
> http://elixir.free-electrons.com/linux/v4.4.76/source/kernel/sched/core.c#L7625
> .
> 
> we have a query:
> Are below error messages pointing to an issue which we can face later
> if we remove force panic?
> http://elixir.free-electrons.com/linux/v4.4.76/source/kernel/sched/core.c#L7605
> http://elixir.free-electrons.com/linux/v4.4.76/source/kernel/sched/core.c#L7608
> 
> 
> we will retest after removing the force panic and update you if any
> issue occurs.
> config file is attached.
> 
> Regards,
> Ankit Jain
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation
> Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
> Linux Foundation Collaborative Project
> 
> On 2017-08-28 11:50, Al Viro wrote:
>> On Mon, Aug 28, 2017 at 09:53:00AM +0530, ankijain at codeaurora.org 
>> wrote:
>>> Hi Will Deacon/ Al viro
>>> 
>>> 
>>> -->Please find the attached kmsg.txt
>>> <3>[17620.275249] BUG: sleeping function called from invalid context 
>>> at 
>>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/arch/arm64/mm/fault.c:313
>>> <3>[17620.276504] in_atomic(): 0, irqs_disabled(): 0, pid: 10290, 
>>> name:
>>> stress-ng-dirde
>>> <6>[17620.298995] ------------[ cut here ]------------
>>> <2>[17620.299009] kernel BUG at 
>>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528!
>>> <6>[17620.306372] ------------[ cut here ]------------
>>> <2>[17620.327239] kernel BUG at 
>>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528!
>>> 
>>> 
>>> --> we are using arm64 machine with kernel 4.4.
>>> --> can you please guide us, how to capture ESR value while taking 
>>> the
>>> fault?
>>> -->
>>> -    { do_page_fault,    SIGSEGV, SEGV_MAPERR,    "level 3 
>>> translation
>>> fault"    },
>>> +    { do_translation_fault,    SIGSEGV, SEGV_MAPERR,    "level 3
>>> translation fault"    },
>>> we will try with above changes and get back to you.
>>> 
>>> -> config and kmsg are attached.
>>> 
>>> Regards,
>>> Ankit Jain
>>> Qualcomm India Private Limited, on behalf of Qualcomm Innovation
>>> Center, Inc.
>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
>>> Linux Foundation Collaborative Project
>> 
>> Umm...  Line numbers make no sense for 4.4.  Could you post a 
>> reference
>> to the actual tree used (repository + SHA1; again, it can't be vanilla
>> 4.4, or stable/linux-4.4.y, for that matter) as well as your .config?
>> 
>> In any case, looks like in_atomic() is false there, so we need an 
>> explicit
>> pagefault_disable() to make sure it goes to no_context.
>> 
>> Looking through the callchains...
>> 	* __d_lookup() -> d_same_name() -> dentry_cmp() -> 
>> dentry_string_cmp()
>> with rcu_read_lock() held by __d_lookup().
>> 	* d_alloc_parallel() -> d_same_name(), etc.  rcu_read_lock() held by
>> d_alloc_parallel() in one case, dentry->d_lock in another.
>> 	* d_exact_alias() -> d_same_name().  inode->i_lock held by 
>> d_exact_alias().
>> 	* d_alloc_parallel() -> __d_lookup_rcu() -> dentry_cmp().
>> rcu_read_lock() held by d_alloc_parallel().
>> 	* lookup_fast() -> __d_lookup_rcu(), etc.  rcu_read_lock() grabbed by
>> path_init().
>> 	* full_name_hash().  Fuckloads.
>> 	* hashlen_string().  Fewer, but...
>> 	* link_path_walk() -> hash_name().  rcu_read_lock() held by 
>> path_init().
>> 
>> And then there's siphash(), but that one AFAICS should never see those 
>> faults.
>> 
>> Hell knows...  I'm somewhat tempted to slap
>> pagefault_disable()/pagefault_enable()
>> in dentry_string_cmp(), full_name_hash(), hashlen_string() and 
>> hash_name().
>> Regardless of the locks held by callers.  Doing that in 
>> load_unaligned_zeropad()
>> itself would be ridiculously costly, but these 4 would probably be 
>> saner...
>> 
>> I still would like to see the details of config, though.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Page fault while link_path_walk for path_len > 4060 bytes
       [not found]         ` <b13b3a27e92c5413d168ad775163ea91@codeaurora.org>
  2017-08-28  6:20           ` Al Viro
@ 2017-09-12 20:26           ` Will Deacon
  2017-09-13 17:05             ` ankijain at codeaurora.org
  2017-11-04  0:17             ` Al Viro
  1 sibling, 2 replies; 11+ messages in thread
From: Will Deacon @ 2017-09-12 20:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Aug 28, 2017 at 09:53:00AM +0530, ankijain at codeaurora.org wrote:
> Hi Will Deacon/ Al viro
> 
> 
> -->Please find the attached kmsg.txt
> <3>[17620.275249] BUG: sleeping function called from invalid context at /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/arch/arm64/mm/fault.c:313
> <3>[17620.276504] in_atomic(): 0, irqs_disabled(): 0, pid: 10290, name:
> stress-ng-dirde
> <6>[17620.298995] ------------[ cut here ]------------
> <2>[17620.299009] kernel BUG at /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528!
> <6>[17620.306372] ------------[ cut here ]------------
> <2>[17620.327239] kernel BUG at /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528!
> 
> 
> --> we are using arm64 machine with kernel 4.4.
> --> can you please guide us, how to capture ESR value while taking the
> fault?
> -->
> -    { do_page_fault,    SIGSEGV, SEGV_MAPERR,    "level 3 translation
> fault"    },
> +    { do_translation_fault,    SIGSEGV, SEGV_MAPERR,    "level 3
> translation fault"    },
> we will try with above changes and get back to you.

Did you test with this change?

Will

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Page fault while link_path_walk for path_len > 4060 bytes
  2017-09-12 20:26           ` Will Deacon
@ 2017-09-13 17:05             ` ankijain at codeaurora.org
  2017-09-26 17:36               ` ankijain at codeaurora.org
  2017-11-04  0:17             ` Al Viro
  1 sibling, 1 reply; 11+ messages in thread
From: ankijain at codeaurora.org @ 2017-09-13 17:05 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will



>> -    { do_page_fault,    SIGSEGV, SEGV_MAPERR,    "level 3 translation
>> fault"    },
>> +    { do_translation_fault,    SIGSEGV, SEGV_MAPERR,    "level 3
>> translation fault"    },
>> we will try with above changes and get back to you.

we didn't try yet with above changes.
we will start the test soon and update you.

Regards,
Ankit Jain
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, 
Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project



On 2017-09-13 01:56, Will Deacon wrote:
> On Mon, Aug 28, 2017 at 09:53:00AM +0530, ankijain at codeaurora.org 
> wrote:
>> Hi Will Deacon/ Al viro
>> 
>> 
>> -->Please find the attached kmsg.txt
>> <3>[17620.275249] BUG: sleeping function called from invalid context 
>> at 
>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/arch/arm64/mm/fault.c:313
>> <3>[17620.276504] in_atomic(): 0, irqs_disabled(): 0, pid: 10290, 
>> name:
>> stress-ng-dirde
>> <6>[17620.298995] ------------[ cut here ]------------
>> <2>[17620.299009] kernel BUG at 
>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528!
>> <6>[17620.306372] ------------[ cut here ]------------
>> <2>[17620.327239] kernel BUG at 
>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528!
>> 
>> 
>> --> we are using arm64 machine with kernel 4.4.
>> --> can you please guide us, how to capture ESR value while taking the
>> fault?
>> -->
>> -    { do_page_fault,    SIGSEGV, SEGV_MAPERR,    "level 3 translation
>> fault"    },
>> +    { do_translation_fault,    SIGSEGV, SEGV_MAPERR,    "level 3
>> translation fault"    },
>> we will try with above changes and get back to you.
> 
> Did you test with this change?
> 
> Will

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Page fault while link_path_walk for path_len > 4060 bytes
  2017-09-13 17:05             ` ankijain at codeaurora.org
@ 2017-09-26 17:36               ` ankijain at codeaurora.org
  0 siblings, 0 replies; 11+ messages in thread
From: ankijain at codeaurora.org @ 2017-09-26 17:36 UTC (permalink / raw)
  To: linux-arm-kernel

Hi will

Unfortunately issue path is not getting hit with our released 
binary(without your chnage) also with your suggested changes.
so we cant confirm that your suggested changes are working or not.

We will come back to you once will face issue again.

Regards,
Ankit Jain

On 2017-09-13 22:35, ankijain at codeaurora.org wrote:
> Hi Will
> 
> 
> 
>>> -    { do_page_fault,    SIGSEGV, SEGV_MAPERR,    "level 3 
>>> translation
>>> fault"    },
>>> +    { do_translation_fault,    SIGSEGV, SEGV_MAPERR,    "level 3
>>> translation fault"    },
>>> we will try with above changes and get back to you.
> 
> we didn't try yet with above changes.
> we will start the test soon and update you.
> 
> Regards,
> Ankit Jain
> Qualcomm India Private Limited, on behalf of Qualcomm Innovation 
> Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
> Linux Foundation Collaborative Project
> 
> 
> 
> On 2017-09-13 01:56, Will Deacon wrote:
>> On Mon, Aug 28, 2017 at 09:53:00AM +0530, ankijain at codeaurora.org 
>> wrote:
>>> Hi Will Deacon/ Al viro
>>> 
>>> 
>>> -->Please find the attached kmsg.txt
>>> <3>[17620.275249] BUG: sleeping function called from invalid context 
>>> at 
>>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/arch/arm64/mm/fault.c:313
>>> <3>[17620.276504] in_atomic(): 0, irqs_disabled(): 0, pid: 10290, 
>>> name:
>>> stress-ng-dirde
>>> <6>[17620.298995] ------------[ cut here ]------------
>>> <2>[17620.299009] kernel BUG at 
>>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528!
>>> <6>[17620.306372] ------------[ cut here ]------------
>>> <2>[17620.327239] kernel BUG at 
>>> /local/mnt/workspace/lnxbuild/project/trees_in_use/free_tree_platform_manifest_refs_tags_AU_LINUX_ANDROID_LA.UM.5.7.07.01.01.287.725_sdm660_64_commander_26168534/checkout/kernel/msm-4.4/kernel/sched/core.c:8528!
>>> 
>>> 
>>> --> we are using arm64 machine with kernel 4.4.
>>> --> can you please guide us, how to capture ESR value while taking 
>>> the
>>> fault?
>>> -->
>>> -    { do_page_fault,    SIGSEGV, SEGV_MAPERR,    "level 3 
>>> translation
>>> fault"    },
>>> +    { do_translation_fault,    SIGSEGV, SEGV_MAPERR,    "level 3
>>> translation fault"    },
>>> we will try with above changes and get back to you.
>> 
>> Did you test with this change?
>> 
>> Will

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Page fault while link_path_walk for path_len > 4060 bytes
  2017-09-12 20:26           ` Will Deacon
  2017-09-13 17:05             ` ankijain at codeaurora.org
@ 2017-11-04  0:17             ` Al Viro
  2017-11-07  1:19               ` Will Deacon
  1 sibling, 1 reply; 11+ messages in thread
From: Al Viro @ 2017-11-04  0:17 UTC (permalink / raw)
  To: linux-arm-kernel

[looking through the old mail]

On Tue, Sep 12, 2017 at 09:26:16PM +0100, Will Deacon wrote:
> > -    { do_page_fault,    SIGSEGV, SEGV_MAPERR,    "level 3 translation
> > fault"    },
> > +    { do_translation_fault,    SIGSEGV, SEGV_MAPERR,    "level 3
> > translation fault"    },
> > we will try with above changes and get back to you.
> 
> Did you test with this change?

FWIW, while that went in as commit 760bfb47c36a ("arm64: fault: Route pte
translation faults via do_translation_fault"), I wonder if the same issue
exists on arm.  It looks like the pagefault handler there is fairly
similar to arm64 one and the same shortcut is present there.

The more I'm looking at that, the more it looks like we *really* need
a comment in all instances of load_unaligned_zeropad() warning about
that pitfall.  Something like
/*
 * Load an unaligned word from kernel space.
 *
 * In the (very unlikely) case of the word being a page-crosser
 * and the next page not being mapped, take the exception and
 * return zeroes in the non-existing part.
 *
 * NOTE: this relies upon the pagefault handler *NOT* blocking
 * in such situation (fault in kernel mode on kernel address with
 * exception fixup present).  Verify that for your architecture
 * before using an equivalent of this approach.  Note that
 * you can't count upon faulthandler_disabled() saving you;
 * this function can be called e.g. under a spinlock on non-preempt
 * kernels without pagefault_disable() done by caller.
 */
perhaps.  That property holds on x86, ppc and (now) arm64, but as
arm64 case shows, it might not be true for other architectures.
As the matter of fact, e.g. sparc64 (which will not use that
thing for obvious reasons anyway) it is *not* true, etc.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Page fault while link_path_walk for path_len > 4060 bytes
  2017-11-04  0:17             ` Al Viro
@ 2017-11-07  1:19               ` Will Deacon
  0 siblings, 0 replies; 11+ messages in thread
From: Will Deacon @ 2017-11-07  1:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Nov 04, 2017 at 12:17:57AM +0000, Al Viro wrote:
> [looking through the old mail]
> 
> On Tue, Sep 12, 2017 at 09:26:16PM +0100, Will Deacon wrote:
> > > -    { do_page_fault,    SIGSEGV, SEGV_MAPERR,    "level 3 translation
> > > fault"    },
> > > +    { do_translation_fault,    SIGSEGV, SEGV_MAPERR,    "level 3
> > > translation fault"    },
> > > we will try with above changes and get back to you.
> > 
> > Did you test with this change?
> 
> FWIW, while that went in as commit 760bfb47c36a ("arm64: fault: Route pte
> translation faults via do_translation_fault"), I wonder if the same issue
> exists on arm.  It looks like the pagefault handler there is fairly
> similar to arm64 one and the same shortcut is present there.

So we never actually nailed this one down on arm64 (the problem mysteriously
disappeared iirc, and I failed to reproduce it on my systems) but I fixed it
anyway because it looked at least theoretically possible. In that case, yes,
I think ARM needs the same sort of fixes for both 2 and 3 level tables where
the handler branch directly to do_page_fault.

> The more I'm looking at that, the more it looks like we *really* need
> a comment in all instances of load_unaligned_zeropad() warning about
> that pitfall.  Something like
> /*
>  * Load an unaligned word from kernel space.
>  *
>  * In the (very unlikely) case of the word being a page-crosser
>  * and the next page not being mapped, take the exception and
>  * return zeroes in the non-existing part.
>  *
>  * NOTE: this relies upon the pagefault handler *NOT* blocking
>  * in such situation (fault in kernel mode on kernel address with
>  * exception fixup present).  Verify that for your architecture
>  * before using an equivalent of this approach.  Note that
>  * you can't count upon faulthandler_disabled() saving you;
>  * this function can be called e.g. under a spinlock on non-preempt
>  * kernels without pagefault_disable() done by caller.
>  */
> perhaps.  That property holds on x86, ppc and (now) arm64, but as
> arm64 case shows, it might not be true for other architectures.
> As the matter of fact, e.g. sparc64 (which will not use that
> thing for obvious reasons anyway) it is *not* true, etc.

I wonder if there's any mileage in a test module for this? That code rarely
(if ever) gets run in practice.

Will

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-11-07  1:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-07  6:22 Page fault while link_path_walk for path_len > 4060 bytes ankijain
2017-08-07 12:18 ` ankijain
2017-08-14 11:14   ` ankijain
     [not found]     ` <953068e79da559bfd4f13e46e31c5a4e@codeaurora.org>
2017-08-22 12:57       ` Will Deacon
     [not found]         ` <b13b3a27e92c5413d168ad775163ea91@codeaurora.org>
2017-08-28  6:20           ` Al Viro
     [not found]             ` <c4256a77288fd5f275a45bc73c9c78f0@codeaurora.org>
2017-09-11  3:14               ` ankijain at codeaurora.org
2017-09-12 20:26           ` Will Deacon
2017-09-13 17:05             ` ankijain at codeaurora.org
2017-09-26 17:36               ` ankijain at codeaurora.org
2017-11-04  0:17             ` Al Viro
2017-11-07  1:19               ` Will Deacon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.