All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] reconnect_one(): fix a missing error code
@ 2017-06-14  9:30 ` Dan Carpenter
  0 siblings, 0 replies; 14+ messages in thread
From: Dan Carpenter @ 2017-06-14  9:30 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: David Howells, NeilBrown, Al Viro, Ingo Molnar, linux-kernel,
	kernel-janitors

I found this bug by reviewing places where we do ERR_PTR(0) (which is
NULL).

We used to return an error pointer if lookup_one_len() failed but we
moved this code into a helper function and accidentally removed that.
NULL is a valid return for this function but it's not what we intended.

Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 329a5d103846..451237745689 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -147,6 +147,7 @@ static struct dentry *reconnect_one(struct vfsmount *mnt,
 	tmp = lookup_one_len_unlocked(nbuf, parent, strlen(nbuf));
 	if (IS_ERR(tmp)) {
 		dprintk("%s: lookup failed: %d\n", __func__, PTR_ERR(tmp));
+		err = PTR_ERR(tmp);
 		goto out_err;
 	}
 	if (tmp != dentry) {

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH] reconnect_one(): fix a missing error code
@ 2017-06-14  9:30 ` Dan Carpenter
  0 siblings, 0 replies; 14+ messages in thread
From: Dan Carpenter @ 2017-06-14  9:30 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: David Howells, NeilBrown, Al Viro, Ingo Molnar, linux-kernel,
	kernel-janitors

I found this bug by reviewing places where we do ERR_PTR(0) (which is
NULL).

We used to return an error pointer if lookup_one_len() failed but we
moved this code into a helper function and accidentally removed that.
NULL is a valid return for this function but it's not what we intended.

Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 329a5d103846..451237745689 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -147,6 +147,7 @@ static struct dentry *reconnect_one(struct vfsmount *mnt,
 	tmp = lookup_one_len_unlocked(nbuf, parent, strlen(nbuf));
 	if (IS_ERR(tmp)) {
 		dprintk("%s: lookup failed: %d\n", __func__, PTR_ERR(tmp));
+		err = PTR_ERR(tmp);
 		goto out_err;
 	}
 	if (tmp != dentry) {

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] reconnect_one(): fix a missing error code
  2017-06-14  9:30 ` Dan Carpenter
@ 2017-06-14 20:34   ` J. Bruce Fields
  -1 siblings, 0 replies; 14+ messages in thread
From: J. Bruce Fields @ 2017-06-14 20:34 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: J. Bruce Fields, David Howells, NeilBrown, Al Viro, Ingo Molnar,
	linux-kernel, kernel-janitors

On Wed, Jun 14, 2017 at 12:30:02PM +0300, Dan Carpenter wrote:
> I found this bug by reviewing places where we do ERR_PTR(0) (which is
> NULL).
> 
> We used to return an error pointer if lookup_one_len() failed but we
> moved this code into a helper function and accidentally removed that.
> NULL is a valid return for this function but it's not what we intended.
> 
> Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

ACK.  Agreed that the current code is wrong, and that this is the
correct fix.

What I don't quite understand yet is what the impact of the bug would
be.

--b.

> 
> diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
> index 329a5d103846..451237745689 100644
> --- a/fs/exportfs/expfs.c
> +++ b/fs/exportfs/expfs.c
> @@ -147,6 +147,7 @@ static struct dentry *reconnect_one(struct vfsmount *mnt,
>  	tmp = lookup_one_len_unlocked(nbuf, parent, strlen(nbuf));
>  	if (IS_ERR(tmp)) {
>  		dprintk("%s: lookup failed: %d\n", __func__, PTR_ERR(tmp));
> +		err = PTR_ERR(tmp);
>  		goto out_err;
>  	}
>  	if (tmp != dentry) {

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] reconnect_one(): fix a missing error code
@ 2017-06-14 20:34   ` J. Bruce Fields
  0 siblings, 0 replies; 14+ messages in thread
From: J. Bruce Fields @ 2017-06-14 20:34 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: J. Bruce Fields, David Howells, NeilBrown, Al Viro, Ingo Molnar,
	linux-kernel, kernel-janitors

On Wed, Jun 14, 2017 at 12:30:02PM +0300, Dan Carpenter wrote:
> I found this bug by reviewing places where we do ERR_PTR(0) (which is
> NULL).
> 
> We used to return an error pointer if lookup_one_len() failed but we
> moved this code into a helper function and accidentally removed that.
> NULL is a valid return for this function but it's not what we intended.
> 
> Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

ACK.  Agreed that the current code is wrong, and that this is the
correct fix.

What I don't quite understand yet is what the impact of the bug would
be.

--b.

> 
> diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
> index 329a5d103846..451237745689 100644
> --- a/fs/exportfs/expfs.c
> +++ b/fs/exportfs/expfs.c
> @@ -147,6 +147,7 @@ static struct dentry *reconnect_one(struct vfsmount *mnt,
>  	tmp = lookup_one_len_unlocked(nbuf, parent, strlen(nbuf));
>  	if (IS_ERR(tmp)) {
>  		dprintk("%s: lookup failed: %d\n", __func__, PTR_ERR(tmp));
> +		err = PTR_ERR(tmp);
>  		goto out_err;
>  	}
>  	if (tmp != dentry) {

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] reconnect_one(): fix a missing error code
  2017-06-14 20:34   ` J. Bruce Fields
@ 2017-06-14 21:54     ` NeilBrown
  -1 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2017-06-14 21:54 UTC (permalink / raw)
  To: J. Bruce Fields, Dan Carpenter
  Cc: J. Bruce Fields, David Howells, Al Viro, Ingo Molnar,
	linux-kernel, kernel-janitors

[-- Attachment #1: Type: text/plain, Size: 2281 bytes --]

On Wed, Jun 14 2017, J. Bruce Fields wrote:

> On Wed, Jun 14, 2017 at 12:30:02PM +0300, Dan Carpenter wrote:
>> I found this bug by reviewing places where we do ERR_PTR(0) (which is
>> NULL).
>> 
>> We used to return an error pointer if lookup_one_len() failed but we
>> moved this code into a helper function and accidentally removed that.
>> NULL is a valid return for this function but it's not what we intended.
>> 
>> Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
>> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
>
> ACK.  Agreed that the current code is wrong, and that this is the
> correct fix.
>
> What I don't quite understand yet is what the impact of the bug would
> be.
>

It is interesting that reconnect_path() handles the possibility of
reconnect_one() returning NULL, even though it will only do that if this
"bug" is triggered.
When that happens, the target_dir (a descendent of dentry) gets its
DCACHE_DISCONNECTED flag cleared.

The bug can presumably only be triggered by a race.
We look through a directory to find the name for an  inode
(exportfs_get_name), then try to look up that name and it doesn't exist.

So presumably if you lose the race, some dentry will get
DCACHE_DISCONNECTED cleared, even though it is still disconnected.
This breaks a contract and can cause weirdness in dcache operations.

If the lookup_one_len_unlocked() fails, we should probably retry, at
least once.  But if we do decide to give up, we shouldn't assume it all
worked.

So I suggest:
 - the fix as provided by Dan, plus
 - remove "if (!parent) break;" from reconnect_path(), plus
 - maybe retry the get_name/lookup_one operation once if the first
    attempt fails.

NeilBrown


> --b.
>
>> 
>> diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
>> index 329a5d103846..451237745689 100644
>> --- a/fs/exportfs/expfs.c
>> +++ b/fs/exportfs/expfs.c
>> @@ -147,6 +147,7 @@ static struct dentry *reconnect_one(struct vfsmount *mnt,
>>  	tmp = lookup_one_len_unlocked(nbuf, parent, strlen(nbuf));
>>  	if (IS_ERR(tmp)) {
>>  		dprintk("%s: lookup failed: %d\n", __func__, PTR_ERR(tmp));
>> +		err = PTR_ERR(tmp);
>>  		goto out_err;
>>  	}
>>  	if (tmp != dentry) {

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] reconnect_one(): fix a missing error code
@ 2017-06-14 21:54     ` NeilBrown
  0 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2017-06-14 21:54 UTC (permalink / raw)
  To: J. Bruce Fields, Dan Carpenter
  Cc: J. Bruce Fields, David Howells, Al Viro, Ingo Molnar,
	linux-kernel, kernel-janitors

[-- Attachment #1: Type: text/plain, Size: 2281 bytes --]

On Wed, Jun 14 2017, J. Bruce Fields wrote:

> On Wed, Jun 14, 2017 at 12:30:02PM +0300, Dan Carpenter wrote:
>> I found this bug by reviewing places where we do ERR_PTR(0) (which is
>> NULL).
>> 
>> We used to return an error pointer if lookup_one_len() failed but we
>> moved this code into a helper function and accidentally removed that.
>> NULL is a valid return for this function but it's not what we intended.
>> 
>> Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
>> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
>
> ACK.  Agreed that the current code is wrong, and that this is the
> correct fix.
>
> What I don't quite understand yet is what the impact of the bug would
> be.
>

It is interesting that reconnect_path() handles the possibility of
reconnect_one() returning NULL, even though it will only do that if this
"bug" is triggered.
When that happens, the target_dir (a descendent of dentry) gets its
DCACHE_DISCONNECTED flag cleared.

The bug can presumably only be triggered by a race.
We look through a directory to find the name for an  inode
(exportfs_get_name), then try to look up that name and it doesn't exist.

So presumably if you lose the race, some dentry will get
DCACHE_DISCONNECTED cleared, even though it is still disconnected.
This breaks a contract and can cause weirdness in dcache operations.

If the lookup_one_len_unlocked() fails, we should probably retry, at
least once.  But if we do decide to give up, we shouldn't assume it all
worked.

So I suggest:
 - the fix as provided by Dan, plus
 - remove "if (!parent) break;" from reconnect_path(), plus
 - maybe retry the get_name/lookup_one operation once if the first
    attempt fails.

NeilBrown


> --b.
>
>> 
>> diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
>> index 329a5d103846..451237745689 100644
>> --- a/fs/exportfs/expfs.c
>> +++ b/fs/exportfs/expfs.c
>> @@ -147,6 +147,7 @@ static struct dentry *reconnect_one(struct vfsmount *mnt,
>>  	tmp = lookup_one_len_unlocked(nbuf, parent, strlen(nbuf));
>>  	if (IS_ERR(tmp)) {
>>  		dprintk("%s: lookup failed: %d\n", __func__, PTR_ERR(tmp));
>> +		err = PTR_ERR(tmp);
>>  		goto out_err;
>>  	}
>>  	if (tmp != dentry) {

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] reconnect_one(): fix a missing error code
  2017-06-14 21:54     ` NeilBrown
@ 2017-06-15  9:26       ` Dan Carpenter
  -1 siblings, 0 replies; 14+ messages in thread
From: Dan Carpenter @ 2017-06-15  9:26 UTC (permalink / raw)
  To: NeilBrown
  Cc: J. Bruce Fields, J. Bruce Fields, David Howells, Al Viro,
	Ingo Molnar, linux-kernel, kernel-janitors

On Thu, Jun 15, 2017 at 07:54:57AM +1000, NeilBrown wrote:
> On Wed, Jun 14 2017, J. Bruce Fields wrote:
> 
> > On Wed, Jun 14, 2017 at 12:30:02PM +0300, Dan Carpenter wrote:
> >> I found this bug by reviewing places where we do ERR_PTR(0) (which is
> >> NULL).
> >> 
> >> We used to return an error pointer if lookup_one_len() failed but we
> >> moved this code into a helper function and accidentally removed that.
> >> NULL is a valid return for this function but it's not what we intended.
> >> 
> >> Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
> >> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> >
> > ACK.  Agreed that the current code is wrong, and that this is the
> > correct fix.
> >
> > What I don't quite understand yet is what the impact of the bug would
> > be.
> >
> 
> It is interesting that reconnect_path() handles the possibility of
> reconnect_one() returning NULL, even though it will only do that if this
> "bug" is triggered.

No, we return NULL for the goto out_reconnected case.

regards,
dan carpenter

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] reconnect_one(): fix a missing error code
@ 2017-06-15  9:26       ` Dan Carpenter
  0 siblings, 0 replies; 14+ messages in thread
From: Dan Carpenter @ 2017-06-15  9:26 UTC (permalink / raw)
  To: NeilBrown
  Cc: J. Bruce Fields, J. Bruce Fields, David Howells, Al Viro,
	Ingo Molnar, linux-kernel, kernel-janitors

On Thu, Jun 15, 2017 at 07:54:57AM +1000, NeilBrown wrote:
> On Wed, Jun 14 2017, J. Bruce Fields wrote:
> 
> > On Wed, Jun 14, 2017 at 12:30:02PM +0300, Dan Carpenter wrote:
> >> I found this bug by reviewing places where we do ERR_PTR(0) (which is
> >> NULL).
> >> 
> >> We used to return an error pointer if lookup_one_len() failed but we
> >> moved this code into a helper function and accidentally removed that.
> >> NULL is a valid return for this function but it's not what we intended.
> >> 
> >> Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
> >> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> >
> > ACK.  Agreed that the current code is wrong, and that this is the
> > correct fix.
> >
> > What I don't quite understand yet is what the impact of the bug would
> > be.
> >
> 
> It is interesting that reconnect_path() handles the possibility of
> reconnect_one() returning NULL, even though it will only do that if this
> "bug" is triggered.

No, we return NULL for the goto out_reconnected case.

regards,
dan carpenter



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] reconnect_one(): fix a missing error code
  2017-06-14 21:54     ` NeilBrown
@ 2017-06-15 21:40       ` J. Bruce Fields
  -1 siblings, 0 replies; 14+ messages in thread
From: J. Bruce Fields @ 2017-06-15 21:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dan Carpenter, J. Bruce Fields, David Howells, Al Viro,
	Ingo Molnar, linux-kernel, kernel-janitors

On Thu, Jun 15, 2017 at 07:54:57AM +1000, NeilBrown wrote:
> On Wed, Jun 14 2017, J. Bruce Fields wrote:
> 
> > On Wed, Jun 14, 2017 at 12:30:02PM +0300, Dan Carpenter wrote:
> >> I found this bug by reviewing places where we do ERR_PTR(0) (which is
> >> NULL).
> >> 
> >> We used to return an error pointer if lookup_one_len() failed but we
> >> moved this code into a helper function and accidentally removed that.
> >> NULL is a valid return for this function but it's not what we intended.
> >> 
> >> Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
> >> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> >
> > ACK.  Agreed that the current code is wrong, and that this is the
> > correct fix.
> >
> > What I don't quite understand yet is what the impact of the bug would
> > be.
> >
> 
> It is interesting that reconnect_path() handles the possibility of
> reconnect_one() returning NULL, even though it will only do that if this
> "bug" is triggered.

As Dan says, you're missing a case.

> When that happens, the target_dir (a descendent of dentry) gets its
> DCACHE_DISCONNECTED flag cleared.
> 
> The bug can presumably only be triggered by a race.
> We look through a directory to find the name for an  inode
> (exportfs_get_name), then try to look up that name and it doesn't exist.

Wouldn't lookup_one_len succesfully return a negative dentry in that
case?

I think the error cases here are more likely due to permissions or IO
errors.

So, I wonder if you can get some kind of dcache corruption with an
uncached lookup of a directory with an ancestor that we lack permission
to.

> So presumably if you lose the race, some dentry will get
> DCACHE_DISCONNECTED cleared, even though it is still disconnected.
> This breaks a contract and can cause weirdness in dcache operations.
> 
> If the lookup_one_len_unlocked() fails, we should probably retry, at
> least once.  But if we do decide to give up, we shouldn't assume it all
> worked.
> 
> So I suggest:
>  - the fix as provided by Dan, plus
>  - remove "if (!parent) break;" from reconnect_path(), plus
>  - maybe retry the get_name/lookup_one operation once if the first
>     attempt fails.

See the comments in the code--if we lose the race, then it's because of
a concurrent operation which should have done the reconnection for us.

--b.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] reconnect_one(): fix a missing error code
@ 2017-06-15 21:40       ` J. Bruce Fields
  0 siblings, 0 replies; 14+ messages in thread
From: J. Bruce Fields @ 2017-06-15 21:40 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dan Carpenter, J. Bruce Fields, David Howells, Al Viro,
	Ingo Molnar, linux-kernel, kernel-janitors

On Thu, Jun 15, 2017 at 07:54:57AM +1000, NeilBrown wrote:
> On Wed, Jun 14 2017, J. Bruce Fields wrote:
> 
> > On Wed, Jun 14, 2017 at 12:30:02PM +0300, Dan Carpenter wrote:
> >> I found this bug by reviewing places where we do ERR_PTR(0) (which is
> >> NULL).
> >> 
> >> We used to return an error pointer if lookup_one_len() failed but we
> >> moved this code into a helper function and accidentally removed that.
> >> NULL is a valid return for this function but it's not what we intended.
> >> 
> >> Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
> >> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
> >
> > ACK.  Agreed that the current code is wrong, and that this is the
> > correct fix.
> >
> > What I don't quite understand yet is what the impact of the bug would
> > be.
> >
> 
> It is interesting that reconnect_path() handles the possibility of
> reconnect_one() returning NULL, even though it will only do that if this
> "bug" is triggered.

As Dan says, you're missing a case.

> When that happens, the target_dir (a descendent of dentry) gets its
> DCACHE_DISCONNECTED flag cleared.
> 
> The bug can presumably only be triggered by a race.
> We look through a directory to find the name for an  inode
> (exportfs_get_name), then try to look up that name and it doesn't exist.

Wouldn't lookup_one_len succesfully return a negative dentry in that
case?

I think the error cases here are more likely due to permissions or IO
errors.

So, I wonder if you can get some kind of dcache corruption with an
uncached lookup of a directory with an ancestor that we lack permission
to.

> So presumably if you lose the race, some dentry will get
> DCACHE_DISCONNECTED cleared, even though it is still disconnected.
> This breaks a contract and can cause weirdness in dcache operations.
> 
> If the lookup_one_len_unlocked() fails, we should probably retry, at
> least once.  But if we do decide to give up, we shouldn't assume it all
> worked.
> 
> So I suggest:
>  - the fix as provided by Dan, plus
>  - remove "if (!parent) break;" from reconnect_path(), plus
>  - maybe retry the get_name/lookup_one operation once if the first
>     attempt fails.

See the comments in the code--if we lose the race, then it's because of
a concurrent operation which should have done the reconnection for us.

--b.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] reconnect_one(): fix a missing error code
  2017-06-15 21:40       ` J. Bruce Fields
@ 2017-06-15 22:28         ` NeilBrown
  -1 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2017-06-15 22:28 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Dan Carpenter, J. Bruce Fields, David Howells, Al Viro,
	Ingo Molnar, linux-kernel, kernel-janitors

[-- Attachment #1: Type: text/plain, Size: 3111 bytes --]

On Thu, Jun 15 2017, J. Bruce Fields wrote:

> On Thu, Jun 15, 2017 at 07:54:57AM +1000, NeilBrown wrote:
>> On Wed, Jun 14 2017, J. Bruce Fields wrote:
>> 
>> > On Wed, Jun 14, 2017 at 12:30:02PM +0300, Dan Carpenter wrote:
>> >> I found this bug by reviewing places where we do ERR_PTR(0) (which is
>> >> NULL).
>> >> 
>> >> We used to return an error pointer if lookup_one_len() failed but we
>> >> moved this code into a helper function and accidentally removed that.
>> >> NULL is a valid return for this function but it's not what we intended.
>> >> 
>> >> Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
>> >> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
>> >
>> > ACK.  Agreed that the current code is wrong, and that this is the
>> > correct fix.
>> >
>> > What I don't quite understand yet is what the impact of the bug would
>> > be.
>> >
>> 
>> It is interesting that reconnect_path() handles the possibility of
>> reconnect_one() returning NULL, even though it will only do that if this
>> "bug" is triggered.
>
> As Dan says, you're missing a case.

:-(

>
>> When that happens, the target_dir (a descendent of dentry) gets its
>> DCACHE_DISCONNECTED flag cleared.
>> 
>> The bug can presumably only be triggered by a race.
>> We look through a directory to find the name for an  inode
>> (exportfs_get_name), then try to look up that name and it doesn't exist.
>
> Wouldn't lookup_one_len succesfully return a negative dentry in that
> case?

Uhm, yes.  That case has a nice big comment in the "tmp != dentry" case
too.

>
> I think the error cases here are more likely due to permissions or IO
> errors.
>
> So, I wonder if you can get some kind of dcache corruption with an
> uncached lookup of a directory with an ancestor that we lack permission
> to.

It wouldn't be too hard to create that scenario.  It might be harder to
find a way to expose the corruption.

I think you should trigger the WARN_ON_ONCE() in clear_disconnected() if
you manage to trigger the bug.

The main reason that it is dangerous to have disconnected dentries
around is for the loop detection on directory renames.
You might be able to confuse the locking logic in there if one of
the directories isn't connected to the root.

Thanks,
NeilBrown


>
>> So presumably if you lose the race, some dentry will get
>> DCACHE_DISCONNECTED cleared, even though it is still disconnected.
>> This breaks a contract and can cause weirdness in dcache operations.
>> 
>> If the lookup_one_len_unlocked() fails, we should probably retry, at
>> least once.  But if we do decide to give up, we shouldn't assume it all
>> worked.
>> 
>> So I suggest:
>>  - the fix as provided by Dan, plus
>>  - remove "if (!parent) break;" from reconnect_path(), plus
>>  - maybe retry the get_name/lookup_one operation once if the first
>>     attempt fails.
>
> See the comments in the code--if we lose the race, then it's because of
> a concurrent operation which should have done the reconnection for us.
>
> --b.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] reconnect_one(): fix a missing error code
@ 2017-06-15 22:28         ` NeilBrown
  0 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2017-06-15 22:28 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Dan Carpenter, J. Bruce Fields, David Howells, Al Viro,
	Ingo Molnar, linux-kernel, kernel-janitors

[-- Attachment #1: Type: text/plain, Size: 3111 bytes --]

On Thu, Jun 15 2017, J. Bruce Fields wrote:

> On Thu, Jun 15, 2017 at 07:54:57AM +1000, NeilBrown wrote:
>> On Wed, Jun 14 2017, J. Bruce Fields wrote:
>> 
>> > On Wed, Jun 14, 2017 at 12:30:02PM +0300, Dan Carpenter wrote:
>> >> I found this bug by reviewing places where we do ERR_PTR(0) (which is
>> >> NULL).
>> >> 
>> >> We used to return an error pointer if lookup_one_len() failed but we
>> >> moved this code into a helper function and accidentally removed that.
>> >> NULL is a valid return for this function but it's not what we intended.
>> >> 
>> >> Fixes: bbf7a8a3562f ("exportfs: move most of reconnect_path to helper function")
>> >> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
>> >
>> > ACK.  Agreed that the current code is wrong, and that this is the
>> > correct fix.
>> >
>> > What I don't quite understand yet is what the impact of the bug would
>> > be.
>> >
>> 
>> It is interesting that reconnect_path() handles the possibility of
>> reconnect_one() returning NULL, even though it will only do that if this
>> "bug" is triggered.
>
> As Dan says, you're missing a case.

:-(

>
>> When that happens, the target_dir (a descendent of dentry) gets its
>> DCACHE_DISCONNECTED flag cleared.
>> 
>> The bug can presumably only be triggered by a race.
>> We look through a directory to find the name for an  inode
>> (exportfs_get_name), then try to look up that name and it doesn't exist.
>
> Wouldn't lookup_one_len succesfully return a negative dentry in that
> case?

Uhm, yes.  That case has a nice big comment in the "tmp != dentry" case
too.

>
> I think the error cases here are more likely due to permissions or IO
> errors.
>
> So, I wonder if you can get some kind of dcache corruption with an
> uncached lookup of a directory with an ancestor that we lack permission
> to.

It wouldn't be too hard to create that scenario.  It might be harder to
find a way to expose the corruption.

I think you should trigger the WARN_ON_ONCE() in clear_disconnected() if
you manage to trigger the bug.

The main reason that it is dangerous to have disconnected dentries
around is for the loop detection on directory renames.
You might be able to confuse the locking logic in there if one of
the directories isn't connected to the root.

Thanks,
NeilBrown


>
>> So presumably if you lose the race, some dentry will get
>> DCACHE_DISCONNECTED cleared, even though it is still disconnected.
>> This breaks a contract and can cause weirdness in dcache operations.
>> 
>> If the lookup_one_len_unlocked() fails, we should probably retry, at
>> least once.  But if we do decide to give up, we shouldn't assume it all
>> worked.
>> 
>> So I suggest:
>>  - the fix as provided by Dan, plus
>>  - remove "if (!parent) break;" from reconnect_path(), plus
>>  - maybe retry the get_name/lookup_one operation once if the first
>>     attempt fails.
>
> See the comments in the code--if we lose the race, then it's because of
> a concurrent operation which should have done the reconnection for us.
>
> --b.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] reconnect_one(): fix a missing error code
  2017-06-15 22:28         ` NeilBrown
@ 2017-06-16 13:50           ` J. Bruce Fields
  -1 siblings, 0 replies; 14+ messages in thread
From: J. Bruce Fields @ 2017-06-16 13:50 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dan Carpenter, J. Bruce Fields, David Howells, Al Viro,
	Ingo Molnar, linux-kernel, kernel-janitors

On Fri, Jun 16, 2017 at 08:28:31AM +1000, NeilBrown wrote:
> It wouldn't be too hard to create that scenario.  It might be harder to
> find a way to expose the corruption.

Yes.  Well, in any case, I assume this Al's to take.  ACK to the patch,
and it should go to stable as well.

--b.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] reconnect_one(): fix a missing error code
@ 2017-06-16 13:50           ` J. Bruce Fields
  0 siblings, 0 replies; 14+ messages in thread
From: J. Bruce Fields @ 2017-06-16 13:50 UTC (permalink / raw)
  To: NeilBrown
  Cc: Dan Carpenter, J. Bruce Fields, David Howells, Al Viro,
	Ingo Molnar, linux-kernel, kernel-janitors

On Fri, Jun 16, 2017 at 08:28:31AM +1000, NeilBrown wrote:
> It wouldn't be too hard to create that scenario.  It might be harder to
> find a way to expose the corruption.

Yes.  Well, in any case, I assume this Al's to take.  ACK to the patch,
and it should go to stable as well.

--b.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-06-16 13:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-14  9:30 [PATCH] reconnect_one(): fix a missing error code Dan Carpenter
2017-06-14  9:30 ` Dan Carpenter
2017-06-14 20:34 ` J. Bruce Fields
2017-06-14 20:34   ` J. Bruce Fields
2017-06-14 21:54   ` NeilBrown
2017-06-14 21:54     ` NeilBrown
2017-06-15  9:26     ` Dan Carpenter
2017-06-15  9:26       ` Dan Carpenter
2017-06-15 21:40     ` J. Bruce Fields
2017-06-15 21:40       ` J. Bruce Fields
2017-06-15 22:28       ` NeilBrown
2017-06-15 22:28         ` NeilBrown
2017-06-16 13:50         ` J. Bruce Fields
2017-06-16 13:50           ` J. Bruce Fields

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.