All of lore.kernel.org
 help / color / mirror / Atom feed
* Fwd: w1/slaves/w1_therm: null-ptr access of sl->family_data
       [not found]   ` <CA+1k2cEwtY+U7TS3rpmMp5nEBokO8vwhcOiD0ExVQnuoB=XVLQ@mail.gmail.com>
@ 2015-02-23 17:09     ` Thorsten Bschorr
  2015-02-24  1:37       ` David Fries
  0 siblings, 1 reply; 24+ messages in thread
From: Thorsten Bschorr @ 2015-02-23 17:09 UTC (permalink / raw)
  To: linux-kernel; +Cc: Evgeniy Polyakov, David Fries

Hi,

I have observed a null-pointer access in w1/slaves/w1_therm on my Raspberry
Pi running 3.18.5 with several DS18S20 temperature sensors. I have already
discussed the problem with Evgeniy.

@Evgeniy & David: sorry for re-sending my message, my email
accidentally contained a HTML part.



w1_therm: w1_slave_show checks if the sensor uses external power. If this
is the case, the mutex on dev->bus_mutex is unlocked while waiting 750 ms
for the sensor to convert the temperature. Before reading the temperature,
the mutex is again locked.

During this sleep-time, the sensor could get detached, for example by w1.c:
w1_search_process_cb not finding the sensor (*):
 !test_bit(W1_SLAVE_ACTIVE, &sl->flags) ==true.
This triggers w1_slave_detach, and hence w1_therm_remove_slave frees and
nulls sl->family_data.

w1_slave_show does not check if familiy_data is null after re-acquiring the
bus_mutex resulting in a null-ptr access when storing data.

After I've added checks for family_data!=0, I did not observe any more
crashes; the other data of struct w1_slave seem to be valid as long as any
thread executes w1_slave_show.


I have added debug-output to w1.c and w1_therm.c, here's a log of a
potential crash:

[184199.510227] w1_master_driver w1_bus_master2: w1_search_process_cb,
!W1_SLAVE_ACTIVE, calling w1_slave_detach
[184199.510276] w1_slave_driver 10-000802d9c9e4: w1_slave_detach
destroy_now 1
[184199.510297] w1_slave_driver 10-000802d9c9e4: w1_unref_slave refcnt 0
[184199.510321] w1_slave_driver 10-000802d9c9e4: w1_unref_slave: detaching
10-000802d9c9e4 [d5fb8800].
[184199.510347] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
w1_family_notify
[184199.510365] w1_slave_driver 10-000802d9c9e4: w1_family_notify calling
remove_slave
[184199.510382] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
[184199.510400] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
refcnt -1
[184200.049745] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
sleep), family_data==0
[184200.137133] w1_slave_driver 10-000802d9c9e4: w1_slave_show (before
sleep), family_data==0
[184200.889551] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
sleep), family_data==0
[184200.930866] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
sleep), family_data==0
[184200.930907] w1_slave_driver 10-000802d9c9e4: Read failed CRC check
[184200.931002] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
device_unregister
[184200.931169] w1 w1_unref_slave -> kfree

Note: When this crash happened, multiple threads were reading the sensor.


I could trigger the problem several times, and each time device_unregister
in w1_unref_slave was executed  *after*  w1_slave_show. In one case, the
logged time-difference between the first family_data==0 message and
device_unregister was about 8 seconds!
I have not observed a w1_slave_show call after w1_unref_slave (as long as
the device was not re-attached again).

>From my observation, the w1_slave data seem to be valid as long as
w1_slave_show is executed.
My guess is that the call of sysfs_remove_groups in w1_family_notify hits a
mutex (I did not add debug output here).


(*) On my tiny raspberry, this happens from time to time with high CPU and
external disc load (timing and/or electrical issues); it seems that the
sensor does not respond in time to the (periodic) search.


Please email me if you need further information.



Best regards,
Thorsten Bschorr

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Fwd: w1/slaves/w1_therm: null-ptr access of sl->family_data
  2015-02-23 17:09     ` Fwd: w1/slaves/w1_therm: null-ptr access of sl->family_data Thorsten Bschorr
@ 2015-02-24  1:37       ` David Fries
  2015-02-25  9:28         ` Thorsten Bschorr
  0 siblings, 1 reply; 24+ messages in thread
From: David Fries @ 2015-02-24  1:37 UTC (permalink / raw)
  To: Thorsten Bschorr; +Cc: linux-kernel, Evgeniy Polyakov

Thorsten,
Are you planning on submitting a patch?

FYI, I load the one wire module with
wire search_count=1
in
/etc/modules
which tells it to search the bus once only.  That works for me as
devices don't come and go on my bus, and it isn't scanning the bus
every few seconds just to find out nothing changed.  If you are
finding devices vanish even when they aren't maybe it will be useful
to you as well, though the bigger problem might be why they are
vanishing in the first place.

On Mon, Feb 23, 2015 at 06:09:16PM +0100, Thorsten Bschorr wrote:
> Hi,
> 
> I have observed a null-pointer access in w1/slaves/w1_therm on my Raspberry
> Pi running 3.18.5 with several DS18S20 temperature sensors. I have already
> discussed the problem with Evgeniy.
> 
> @Evgeniy & David: sorry for re-sending my message, my email
> accidentally contained a HTML part.
> 
> 
> 
> w1_therm: w1_slave_show checks if the sensor uses external power. If this
> is the case, the mutex on dev->bus_mutex is unlocked while waiting 750 ms
> for the sensor to convert the temperature. Before reading the temperature,
> the mutex is again locked.
> 
> During this sleep-time, the sensor could get detached, for example by w1.c:
> w1_search_process_cb not finding the sensor (*):
>  !test_bit(W1_SLAVE_ACTIVE, &sl->flags) ==true.
> This triggers w1_slave_detach, and hence w1_therm_remove_slave frees and
> nulls sl->family_data.
> 
> w1_slave_show does not check if familiy_data is null after re-acquiring the
> bus_mutex resulting in a null-ptr access when storing data.
> 
> After I've added checks for family_data!=0, I did not observe any more
> crashes; the other data of struct w1_slave seem to be valid as long as any
> thread executes w1_slave_show.
> 
> 
> I have added debug-output to w1.c and w1_therm.c, here's a log of a
> potential crash:
> 
> [184199.510227] w1_master_driver w1_bus_master2: w1_search_process_cb,
> !W1_SLAVE_ACTIVE, calling w1_slave_detach
> [184199.510276] w1_slave_driver 10-000802d9c9e4: w1_slave_detach
> destroy_now 1
> [184199.510297] w1_slave_driver 10-000802d9c9e4: w1_unref_slave refcnt 0
> [184199.510321] w1_slave_driver 10-000802d9c9e4: w1_unref_slave: detaching
> 10-000802d9c9e4 [d5fb8800].
> [184199.510347] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
> w1_family_notify
> [184199.510365] w1_slave_driver 10-000802d9c9e4: w1_family_notify calling
> remove_slave
> [184199.510382] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
> [184199.510400] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
> refcnt -1
> [184200.049745] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
> sleep), family_data==0
> [184200.137133] w1_slave_driver 10-000802d9c9e4: w1_slave_show (before
> sleep), family_data==0
> [184200.889551] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
> sleep), family_data==0
> [184200.930866] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
> sleep), family_data==0
> [184200.930907] w1_slave_driver 10-000802d9c9e4: Read failed CRC check
> [184200.931002] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
> device_unregister
> [184200.931169] w1 w1_unref_slave -> kfree
> 
> Note: When this crash happened, multiple threads were reading the sensor.
> 
> 
> I could trigger the problem several times, and each time device_unregister
> in w1_unref_slave was executed  *after*  w1_slave_show. In one case, the
> logged time-difference between the first family_data==0 message and
> device_unregister was about 8 seconds!
> I have not observed a w1_slave_show call after w1_unref_slave (as long as
> the device was not re-attached again).
> 
> >From my observation, the w1_slave data seem to be valid as long as
> w1_slave_show is executed.
> My guess is that the call of sysfs_remove_groups in w1_family_notify hits a
> mutex (I did not add debug output here).
> 
> 
> (*) On my tiny raspberry, this happens from time to time with high CPU and
> external disc load (timing and/or electrical issues); it seems that the
> sensor does not respond in time to the (periodic) search.
> 
> 
> Please email me if you need further information.
> 
> 
> 
> Best regards,
> Thorsten Bschorr
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
David Fries <david@fries.net>    PGP pub CB1EE8F0
http://fries.net/~david/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: Fwd: w1/slaves/w1_therm: null-ptr access of sl->family_data
  2015-02-24  1:37       ` David Fries
@ 2015-02-25  9:28         ` Thorsten Bschorr
  0 siblings, 0 replies; 24+ messages in thread
From: Thorsten Bschorr @ 2015-02-25  9:28 UTC (permalink / raw)
  To: David Fries; +Cc: linux-kernel, Evgeniy Polyakov

David,

I can try to prepare a patch fixing the null-ptr access.

Thanks for the hint with the search_count, I'll try it.


2015-02-24 2:37 GMT+01:00 David Fries <David@fries.net>:
> Thorsten,
> Are you planning on submitting a patch?
>
> FYI, I load the one wire module with
> wire search_count=1
> in
> /etc/modules
> which tells it to search the bus once only.  That works for me as
> devices don't come and go on my bus, and it isn't scanning the bus
> every few seconds just to find out nothing changed.  If you are
> finding devices vanish even when they aren't maybe it will be useful
> to you as well, though the bigger problem might be why they are
> vanishing in the first place.
>
> On Mon, Feb 23, 2015 at 06:09:16PM +0100, Thorsten Bschorr wrote:
>> Hi,
>>
>> I have observed a null-pointer access in w1/slaves/w1_therm on my Raspberry
>> Pi running 3.18.5 with several DS18S20 temperature sensors. I have already
>> discussed the problem with Evgeniy.
>>
>> @Evgeniy & David: sorry for re-sending my message, my email
>> accidentally contained a HTML part.
>>
>>
>>
>> w1_therm: w1_slave_show checks if the sensor uses external power. If this
>> is the case, the mutex on dev->bus_mutex is unlocked while waiting 750 ms
>> for the sensor to convert the temperature. Before reading the temperature,
>> the mutex is again locked.
>>
>> During this sleep-time, the sensor could get detached, for example by w1.c:
>> w1_search_process_cb not finding the sensor (*):
>>  !test_bit(W1_SLAVE_ACTIVE, &sl->flags) ==true.
>> This triggers w1_slave_detach, and hence w1_therm_remove_slave frees and
>> nulls sl->family_data.
>>
>> w1_slave_show does not check if familiy_data is null after re-acquiring the
>> bus_mutex resulting in a null-ptr access when storing data.
>>
>> After I've added checks for family_data!=0, I did not observe any more
>> crashes; the other data of struct w1_slave seem to be valid as long as any
>> thread executes w1_slave_show.
>>
>>
>> I have added debug-output to w1.c and w1_therm.c, here's a log of a
>> potential crash:
>>
>> [184199.510227] w1_master_driver w1_bus_master2: w1_search_process_cb,
>> !W1_SLAVE_ACTIVE, calling w1_slave_detach
>> [184199.510276] w1_slave_driver 10-000802d9c9e4: w1_slave_detach
>> destroy_now 1
>> [184199.510297] w1_slave_driver 10-000802d9c9e4: w1_unref_slave refcnt 0
>> [184199.510321] w1_slave_driver 10-000802d9c9e4: w1_unref_slave: detaching
>> 10-000802d9c9e4 [d5fb8800].
>> [184199.510347] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
>> w1_family_notify
>> [184199.510365] w1_slave_driver 10-000802d9c9e4: w1_family_notify calling
>> remove_slave
>> [184199.510382] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
>> [184199.510400] w1_slave_driver 10-000802d9c9e4: w1_therm_remove_slave
>> refcnt -1
>> [184200.049745] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
>> sleep), family_data==0
>> [184200.137133] w1_slave_driver 10-000802d9c9e4: w1_slave_show (before
>> sleep), family_data==0
>> [184200.889551] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
>> sleep), family_data==0
>> [184200.930866] w1_slave_driver 10-000802d9c9e4: w1_slave_show (after
>> sleep), family_data==0
>> [184200.930907] w1_slave_driver 10-000802d9c9e4: Read failed CRC check
>> [184200.931002] w1_slave_driver 10-000802d9c9e4: w1_unref_slave ->
>> device_unregister
>> [184200.931169] w1 w1_unref_slave -> kfree
>>
>> Note: When this crash happened, multiple threads were reading the sensor.
>>
>>
>> I could trigger the problem several times, and each time device_unregister
>> in w1_unref_slave was executed  *after*  w1_slave_show. In one case, the
>> logged time-difference between the first family_data==0 message and
>> device_unregister was about 8 seconds!
>> I have not observed a w1_slave_show call after w1_unref_slave (as long as
>> the device was not re-attached again).
>>
>> >From my observation, the w1_slave data seem to be valid as long as
>> w1_slave_show is executed.
>> My guess is that the call of sysfs_remove_groups in w1_family_notify hits a
>> mutex (I did not add debug output here).
>>
>>
>> (*) On my tiny raspberry, this happens from time to time with high CPU and
>> external disc load (timing and/or electrical issues); it seems that the
>> sensor does not respond in time to the (periodic) search.
>>
>>
>> Please email me if you need further information.
>>
>>
>>
>> Best regards,
>> Thorsten Bschorr
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
> --
> David Fries <david@fries.net>    PGP pub CB1EE8F0
> http://fries.net/~david/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
       [not found] <CA+1k2cH5Y+RngvEbgE3CW5g+bO3V1ytDJC=u6DLVDZvbEOhu5A@mail.gmail.com>
       [not found] ` <CA+1k2cF1frcEUu-L_cSJxTp=GKExn6Vt2rdCeY=zrhM62FUggw@mail.gmail.com>
@ 2015-02-27  8:43 ` Thorsten.Bschorr
  2015-02-28 20:17   ` David Fries
  1 sibling, 1 reply; 24+ messages in thread
From: Thorsten.Bschorr @ 2015-02-27  8:43 UTC (permalink / raw)
  To: linux-kernel; +Cc: Evgeniy Polyakov, David Fries

w1_slave_show unlocks the bus while waiting for the sensor
to convert the temperature. If the sensor gets disconnected
during that time, sl->family_data could get NULL.
Note: the w1_slave pointer inside w1_slave_show is protected by
sysfs-mutex
---
 drivers/w1/slaves/w1_therm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/w1/slaves/w1_therm.c b/drivers/w1/slaves/w1_therm.c
index 1f11a20..3a14cb1 100644
--- a/drivers/w1/slaves/w1_therm.c
+++ b/drivers/w1/slaves/w1_therm.c
@@ -236,6 +236,12 @@ static ssize_t w1_slave_show(struct device *device,
 				i = mutex_lock_interruptible(&dev->bus_mutex);
 				if (i != 0)
 					return i;
+
+				/* Ensure that device is still there -
+				 * it could have gone while we unlocked the bus.
+				 * Note: sl is still protected by sysfs-mutex */
+				if(!sl->family_data)
+					return -ENODEV;
 			} else if (!w1_strong_pullup) {
 				sleep_rem = msleep_interruptible(tm);
 				if (sleep_rem != 0) {
-- 
1.8.4.5


---
Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
http://www.avast.com


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-02-27  8:43 ` [PATCH] Avoid null-pointer access in w1/slaves/w1_therm Thorsten.Bschorr
@ 2015-02-28 20:17   ` David Fries
       [not found]     ` <369891425174502@web4m.yandex.ru>
  0 siblings, 1 reply; 24+ messages in thread
From: David Fries @ 2015-02-28 20:17 UTC (permalink / raw)
  To: Thorsten.Bschorr; +Cc: linux-kernel, Evgeniy Polyakov

Thanks for preparing the patch, it looks like it will go another
round, but that happens to everyone.

To simplify to show a problem,

mutex_lock_interruptible(&dev->bus_mutex);

if (!sl->family_data)
	return -ENODEV;

If family_data is NULL, then dev->bus_mutex is left locked.  Your
earlier e-mails indicated that you weren't seeing any problems after
applying this patch, but I would have thought once this early return
was encountered, that bus would not be usable any longer because
bus_mutex would be locked.  I don't know if the bus controller went
away, so there is effectively a new bus, because I assume you are
still able to read the temperature once this condition triggers?

It looks like you need to add mutex_unlock(&dev->bus_mutex); before
return -ENODEV; and that's how it is handled elsewhere in the
function, but you might want to put a printk before the return,
trigger it and verify it is getting hung up on the bus_mutex, it would
also be telling if you can remove the wire and the other w1 modules,
they should not let you remove them while a mutex is held.


The patch needs to be run through the checkpatch script, and it's
listing two issues, adding a space after the if (I also prefer not
having it, but this is for consistency), and missing Signed-off-by:

That would look like this, only with your information, it's a
tracing of where the changes came from and agreeing to the copyright,
you can read more in Documentation/SubmittingPatches if you haven't
already.
Signed-off-by: David Fries <David@Fries.net>

scripts/checkpatch.pl /tmp/w1_null_patch.patch

Thanks, keep up improving things.

On Fri, Feb 27, 2015 at 09:43:14AM +0100, Thorsten.Bschorr wrote:
> w1_slave_show unlocks the bus while waiting for the sensor
> to convert the temperature. If the sensor gets disconnected
> during that time, sl->family_data could get NULL.
> Note: the w1_slave pointer inside w1_slave_show is protected by
> sysfs-mutex
> ---
>  drivers/w1/slaves/w1_therm.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/w1/slaves/w1_therm.c b/drivers/w1/slaves/w1_therm.c
> index 1f11a20..3a14cb1 100644
> --- a/drivers/w1/slaves/w1_therm.c
> +++ b/drivers/w1/slaves/w1_therm.c
> @@ -236,6 +236,12 @@ static ssize_t w1_slave_show(struct device *device,
>  				i = mutex_lock_interruptible(&dev->bus_mutex);
>  				if (i != 0)
>  					return i;
> +
> +				/* Ensure that device is still there -
> +				 * it could have gone while we unlocked the bus.
> +				 * Note: sl is still protected by sysfs-mutex */
> +				if(!sl->family_data)
> +					return -ENODEV;
>  			} else if (!w1_strong_pullup) {
>  				sleep_rem = msleep_interruptible(tm);
>  				if (sleep_rem != 0) {
> -- 
> 1.8.4.5
> 
> 
> ---
> Diese E-Mail wurde von Avast Antivirus-Software auf Viren geprüft.
> http://www.avast.com
> 

-- 
David Fries <david@fries.net>    PGP pub CB1EE8F0
http://fries.net/~david/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
       [not found]     ` <369891425174502@web4m.yandex.ru>
@ 2015-03-01  2:17       ` David Fries
  2015-03-01 13:04         ` Thorsten Bschorr
  0 siblings, 1 reply; 24+ messages in thread
From: David Fries @ 2015-03-01  2:17 UTC (permalink / raw)
  To: ??????? ???????; +Cc: Thorsten.Bschorr, linux-kernel

On Sun, Mar 01, 2015 at 04:48:22AM +0300, ??????? ??????? wrote:
> Hi everyone
> 
> 28.02.2015, 23:18, "David Fries" <David@Fries.net>:
> > Thanks for preparing the patch, it looks like it will go another
> > round, but that happens to everyone.
> 
> Patch itself doesn't solve the problem, it only adds a workaround.
> There is a reference counter issue - reading data via sysfs only holds
> a reference to device, w1 slave structure will be deleted by w1 search core.
> 
> I believe it only accidentally doesn't crash because of kernel memory allocator
> hasn't yet reclaimed memory.

Good point, it does look like that's the case.

w1_unref_slave calls w1_family_notify(BUS_NOTIFY_DEL_DEVICE, sl);
which calls sl->family->fops->remove_slave(sl);
which kfree and sl->family_data = NULL;

In that case what do you think about in w1_slave_show after getting
bus_mutex,
atomic_inc(&sl->refcnt);

then add
w1_unref_slave(sl);
in each return path, or do a goto and consolidate them.

or just increment it while sleeping, which is when it's needed, which
also looks simpler.

			if (external_power) {
+				int refcnt;
				mutex_unlock(&dev->bus_mutex);

+				/* prevent the slave from going away */
+				atomic_inc(&sl->refcnt);
				sleep_rem = msleep_interruptible(tm);
+				refcnt = w1_unref_slave(sl);
-				if (sleep_rem != 0)
+				if (sleep_rem != 0 || !refcnt)
					return -EINTR;

				i = mutex_lock_interruptible(&dev->bus_mutex);
				if (i != 0)
					return i;
			} else if (!w1_strong_pullup) {


-- 
David Fries <david@fries.net>    PGP pub CB1EE8F0
http://fries.net/~david/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-01  2:17       ` David Fries
@ 2015-03-01 13:04         ` Thorsten Bschorr
  2015-03-02  0:17           ` David Fries
  0 siblings, 1 reply; 24+ messages in thread
From: Thorsten Bschorr @ 2015-03-01 13:04 UTC (permalink / raw)
  To: David Fries; +Cc: Evgeniy Polyakov, linux-kernel

Hi David,

thanks for your feedback on my first patch, I wasn't aware of checkpatch.pl.

Initially, I had just if-ed the usage of family-data, which did not
look that nice. I was referring to this proof-of-concept workaround in
my initial bug report.

The patch I've submitted is different from my proof-of-concept
workaround. Not unlocking the bus before returning clearly is an
error, I did not extensively test this patch.


> or just increment it while sleeping, which is when it's needed, which
> also looks simpler.
>
>                         if (external_power) {
> +                               int refcnt;
>                                 mutex_unlock(&dev->bus_mutex);
>
> +                               /* prevent the slave from going away */
> +                               atomic_inc(&sl->refcnt);
>                                 sleep_rem = msleep_interruptible(tm);
> +                               refcnt = w1_unref_slave(sl);
> -                               if (sleep_rem != 0)
> +                               if (sleep_rem != 0 || !refcnt)
>                                         return -EINTR;
>
>                                 i = mutex_lock_interruptible(&dev->bus_mutex);
>                                 if (i != 0)
>                                         return i;
>                         } else if (!w1_strong_pullup) {


I like this better than my workaround-patch.

One thought occurred to me when looking at this proposal: wouldn't it
be even better to increase sl->refcnt before unlocking the mutex?
I was asking myself if it is possible that the current thread gets
suspended between mutex_unlock(&dev->bus_mutex); and
atomic_inc(&sl->refcnt); thus leaving another thread the change to
unref the device?
(I'm not that familiar with linux scheduling, so my assumption might be void.)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-01 13:04         ` Thorsten Bschorr
@ 2015-03-02  0:17           ` David Fries
  2015-03-04 15:36             ` Евгений Поляков
  0 siblings, 1 reply; 24+ messages in thread
From: David Fries @ 2015-03-02  0:17 UTC (permalink / raw)
  To: Thorsten Bschorr; +Cc: Evgeniy Polyakov, linux-kernel

On Sun, Mar 01, 2015 at 02:04:53PM +0100, Thorsten Bschorr wrote:
> Hi David,
> 
> thanks for your feedback on my first patch, I wasn't aware of checkpatch.pl.
> 
> Initially, I had just if-ed the usage of family-data, which did not
> look that nice. I was referring to this proof-of-concept workaround in
> my initial bug report.
> 
> The patch I've submitted is different from my proof-of-concept
> workaround. Not unlocking the bus before returning clearly is an
> error, I did not extensively test this patch.
> 
> 
> > or just increment it while sleeping, which is when it's needed, which
> > also looks simpler.
> >
> >                         if (external_power) {
> > +                               int refcnt;
> >                                 mutex_unlock(&dev->bus_mutex);
> >
> > +                               /* prevent the slave from going away */
> > +                               atomic_inc(&sl->refcnt);
> >                                 sleep_rem = msleep_interruptible(tm);
> > +                               refcnt = w1_unref_slave(sl);
> > -                               if (sleep_rem != 0)
> > +                               if (sleep_rem != 0 || !refcnt)
> >                                         return -EINTR;
> >
> >                                 i = mutex_lock_interruptible(&dev->bus_mutex);
> >                                 if (i != 0)
> >                                         return i;
> >                         } else if (!w1_strong_pullup) {
> 
> 
> I like this better than my workaround-patch.
> 
> One thought occurred to me when looking at this proposal: wouldn't it
> be even better to increase sl->refcnt before unlocking the mutex?
> I was asking myself if it is possible that the current thread gets
> suspended between mutex_unlock(&dev->bus_mutex); and
> atomic_inc(&sl->refcnt); thus leaving another thread the change to
> unref the device?
> (I'm not that familiar with linux scheduling, so my assumption might be void.)

You are correct, it would be a race condition if it doesn't increment
the refcnt before unlocking the mutex, and it should get the mutex
before unref.  Here's an updated version, I haven't even tried to
compile it.

What do you think Evgeniy?

			if (external_power) {
				int refcnt;
				/* prevent the slave from going away in sleep */
				atomic_inc(&sl->refcnt);
				mutex_unlock(&dev->bus_mutex);

				sleep_rem = msleep_interruptible(tm);
				if (sleep_rem != 0) {
					w1_unref_slave(sl);
					return -EINTR;
				}

				i = mutex_lock_interruptible(&dev->bus_mutex);
				refcnt = w1_unref_slave(sl);
				if (i != 0) {
					/* failed to lock */
					return i;
				}
				if (!refcnt)
					/* got lock, but slave went away */
					mutex_unlock(&dev->bus_mutex);
					return -EINTR;
				}
			} else if (!w1_strong_pullup) {


-- 
David Fries <david@fries.net>    PGP pub CB1EE8F0
http://fries.net/~david/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-02  0:17           ` David Fries
@ 2015-03-04 15:36             ` Евгений Поляков
  2015-03-08 21:14               ` David Fries
  0 siblings, 1 reply; 24+ messages in thread
From: Евгений Поляков @ 2015-03-04 15:36 UTC (permalink / raw)
  To: David Fries, Thorsten Bschorr; +Cc: linux-kernel

Hi David

02.03.2015, 03:17, "David Fries" <david@fries.net>:

> You are correct, it would be a race condition if it doesn't increment
> the refcnt before unlocking the mutex, and it should get the mutex
> before unref.  Here's an updated version, I haven't even tried to
> compile it.
>
> What do you think Evgeniy?

Sounds correct, it should do the trick, please cook up a patch, Thorsten, can you test it?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-04 15:36             ` Евгений Поляков
@ 2015-03-08 21:14               ` David Fries
  2015-03-09 22:47                 ` Thorsten Bschorr
  0 siblings, 1 reply; 24+ messages in thread
From: David Fries @ 2015-03-08 21:14 UTC (permalink / raw)
  To: ??????? ???????; +Cc: Thorsten Bschorr, Jonathan ALIBERT, linux-kernel

On Wed, Mar 04, 2015 at 06:36:41PM +0300, ??????? ??????? wrote:
> Hi David
> 
> 02.03.2015, 03:17, "David Fries" <david@fries.net>:
> 
> > You are correct, it would be a race condition if it doesn't increment
> > the refcnt before unlocking the mutex, and it should get the mutex
> > before unref.  Here's an updated version, I haven't even tried to
> > compile it.
> >
> > What do you think Evgeniy?
> 
> Sounds correct, it should do the trick, please cook up a patch, Thorsten, can you test it?


I quickly found out that strategy wasn't going to work.
w1_unref_slave calls w1_family_notify(BUS_NOTIFY_DEL_DEVICE, sl);
which calls sysfs_remove_groups which deadlocks,

[<ffffffff81047426>] ? wait_woken+0x54/0x54
[<ffffffff810e7d6e>] ? kernfs_remove_by_name_ns+0x67/0x82
[<ffffffff810e97c8>] ? remove_files+0x30/0x58
[<ffffffff810e9b8a>] ? sysfs_remove_group+0x60/0x7b
[<ffffffff810e9bc2>] ? sysfs_remove_groups+0x1d/0x2f
[<ffffffffa01951ae>] ? w1_unref_slave+0xd9/0x13b [wire]
[<ffffffffa01a516e>] ? w1_slave_show+0x11a/0x335 [w1_therm]

Here's a different strategy, add a w1_therm family_data specific mutex
so w1_therm_remove_slave won't return while another thread is in
w1_slave_show.  Thoughts?

I included three patches, the first is the proposed fix, the second
makes it more reproducible (and since my testing doesn't have external
power I had to ignore that check), the third is just some debugging
messages.  For testing I'm starting a read from w1_slave then manually
remove the sensor, which seems to do the trick.

echo 28-[ds18b20_id] > /sys/devices/w1_bus_master1/w1_master_remove

Give it a try and let me know how it goes.

My test system host is running 3.16, I'm using kvm to test 3.19 giving
it access to the USB one wire dongle, and there's some refcnt problem
I've not been able to track down.  Once and a while when I have the
kvm grab the USB device the host will start printing a refcount
problem, which takes a reboot to clear, which is annoying because the
reason to test in kvm is to not break the host.

w1_master_driver w1_bus_master1: Waiting for w1_bus_master1 to become
free: refcnt=1.

>From bcde1a8d3d00cc3b13dff7e2476f8c03efc59001 Mon Sep 17 00:00:00 2001
From: David Fries <David@Fries.net>
Date: Sat, 7 Mar 2015 22:25:37 -0600
Subject: [PATCH 1/3] w1_therm, don't let the slave go away while in
 w1_slave_show

A temperature conversion can take 750 ms to complete, if the sensor is
externally powered it releases the bus_mutex while it waits, but if
the slave device is removed sl becomes a dangling pointer.
The race condition window can be 10 * 750 ms = 7.5 seconds if the crc
check fails.

Reported-By: Thorsten Bschorr <thorsten@bschorr.de>
Signed-off-by: David Fries <David@Fries.net>
---
 drivers/w1/slaves/w1_therm.c |   52 ++++++++++++++++++++++++++++++------------
 1 file changed, 38 insertions(+), 14 deletions(-)

diff --git a/drivers/w1/slaves/w1_therm.c b/drivers/w1/slaves/w1_therm.c
index 1f11a20..6b3ef93 100644
--- a/drivers/w1/slaves/w1_therm.c
+++ b/drivers/w1/slaves/w1_therm.c
@@ -59,9 +59,20 @@ MODULE_ALIAS("w1-family-" __stringify(W1_THERM_DS28EA00));
 static int w1_strong_pullup = 1;
 module_param_named(strong_pullup, w1_strong_pullup, int, 0);
 
+struct w1_therm_family_data {
+	uint8_t rom[9];
+	struct mutex lock;
+};
+
+/* return the address of the lock in the family data */
+#define THERM_LOCK(sl) \
+	(&((struct w1_therm_family_data*)sl->family_data)->lock)
+
 static int w1_therm_add_slave(struct w1_slave *sl)
 {
-	sl->family_data = kzalloc(9, GFP_KERNEL);
+	sl->family_data = kzalloc(sizeof(struct w1_therm_family_data),
+		GFP_KERNEL);
+	mutex_init(THERM_LOCK(sl));
 	if (!sl->family_data)
 		return -ENOMEM;
 	return 0;
@@ -69,6 +80,11 @@ static int w1_therm_add_slave(struct w1_slave *sl)
 
 static void w1_therm_remove_slave(struct w1_slave *sl)
 {
+	/* Getting the lock means w1_slave_show isn't sleeping and the
+	 * family_data can be freed.
+	 */
+	mutex_lock(THERM_LOCK(sl));
+	mutex_unlock(THERM_LOCK(sl));
 	kfree(sl->family_data);
 	sl->family_data = NULL;
 }
@@ -194,13 +210,15 @@ static ssize_t w1_slave_show(struct device *device,
 	struct w1_slave *sl = dev_to_w1_slave(device);
 	struct w1_master *dev = sl->master;
 	u8 rom[9], crc, verdict, external_power;
-	int i, max_trying = 10;
+	int i, ret, max_trying = 10;
 	ssize_t c = PAGE_SIZE;
 
-	i = mutex_lock_interruptible(&dev->bus_mutex);
-	if (i != 0)
-		return i;
+	ret = mutex_lock_interruptible(&dev->bus_mutex);
+	if (ret != 0)
+		goto post_unlock;
 
+	/* prevent the slave from going away in sleep */
+	mutex_lock(THERM_LOCK(sl));
 	memset(rom, 0, sizeof(rom));
 
 	while (max_trying--) {
@@ -229,18 +247,19 @@ static ssize_t w1_slave_show(struct device *device,
 			if (external_power) {
 				mutex_unlock(&dev->bus_mutex);
 
-				sleep_rem = msleep_interruptible(tm);
-				if (sleep_rem != 0)
-					return -EINTR;
+				if (sleep_rem != 0) {
+					ret = -EINTR;
+					goto post_unlock;
+				}
 
-				i = mutex_lock_interruptible(&dev->bus_mutex);
-				if (i != 0)
-					return i;
+				ret = mutex_lock_interruptible(&dev->bus_mutex);
+				if (ret != 0)
+					goto post_unlock;
 			} else if (!w1_strong_pullup) {
 				sleep_rem = msleep_interruptible(tm);
 				if (sleep_rem != 0) {
-					mutex_unlock(&dev->bus_mutex);
-					return -EINTR;
+					ret = -EINTR;
+					goto pre_unlock;
 				}
 			}
 
@@ -279,9 +298,14 @@ static ssize_t w1_slave_show(struct device *device,
 
 	c -= snprintf(buf + PAGE_SIZE - c, c, "t=%d\n",
 		w1_convert_temp(rom, sl->family->fid));
+	ret = PAGE_SIZE - c;
+
+pre_unlock:
 	mutex_unlock(&dev->bus_mutex);
 
-	return PAGE_SIZE - c;
+post_unlock:
+	mutex_unlock(THERM_LOCK(sl));
+	return ret;
 }
 
 static int __init w1_therm_init(void)
-- 
1.7.10.4


>From bb3686e0d1c38798ce922e2565ac094757f02f9d Mon Sep 17 00:00:00 2001
From: David Fries <David@Fries.net>
Date: Sat, 7 Mar 2015 19:53:54 -0600
Subject: [PATCH 2/3] increase w1_therm race condition window

Also disable the check for external power as the current setup
doesn't have it.
---
 drivers/w1/slaves/w1_therm.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/w1/slaves/w1_therm.c b/drivers/w1/slaves/w1_therm.c
index 6b3ef93..cb46e85 100644
--- a/drivers/w1/slaves/w1_therm.c
+++ b/drivers/w1/slaves/w1_therm.c
@@ -244,9 +244,11 @@ static ssize_t w1_slave_show(struct device *device,
 
 			w1_write_8(dev, W1_CONVERT_TEMP);
 
-			if (external_power) {
+			/*if (external_power)*/ if(true) {
 				mutex_unlock(&dev->bus_mutex);
 
+				//sleep_rem = msleep_interruptible(tm);
+				sleep_rem = msleep_interruptible(10*1000);
 				if (sleep_rem != 0) {
 					ret = -EINTR;
 					goto post_unlock;
-- 
1.7.10.4


>From 6eff9cbdacb36f0e960b1117b48584af18160443 Mon Sep 17 00:00:00 2001
From: David Fries <David@Fries.net>
Date: Sun, 8 Mar 2015 14:01:28 -0500
Subject: [PATCH 3/3] debugging w1_therm race condition

---
 drivers/w1/slaves/w1_therm.c |    4 ++++
 drivers/w1/w1.c              |    4 ++++
 2 files changed, 8 insertions(+)

diff --git a/drivers/w1/slaves/w1_therm.c b/drivers/w1/slaves/w1_therm.c
index cb46e85..cb0fe1b 100644
--- a/drivers/w1/slaves/w1_therm.c
+++ b/drivers/w1/slaves/w1_therm.c
@@ -80,11 +80,13 @@ static int w1_therm_add_slave(struct w1_slave *sl)
 
 static void w1_therm_remove_slave(struct w1_slave *sl)
 {
+	printk(KERN_NOTICE "%s about to lock faily_data->lock\n", __func__);
 	/* Getting the lock means w1_slave_show isn't sleeping and the
 	 * family_data can be freed.
 	 */
 	mutex_lock(THERM_LOCK(sl));
 	mutex_unlock(THERM_LOCK(sl));
+	printk(KERN_NOTICE "%s unlocked faily_data->lock\n", __func__);
 	kfree(sl->family_data);
 	sl->family_data = NULL;
 }
@@ -248,7 +250,9 @@ static ssize_t w1_slave_show(struct device *device,
 				mutex_unlock(&dev->bus_mutex);
 
 				//sleep_rem = msleep_interruptible(tm);
+				printk(KERN_NOTICE "start sleep %p refcnt %u\n", sl, atomic_read(&sl->refcnt));
 				sleep_rem = msleep_interruptible(10*1000);
+				printk(KERN_NOTICE "end sleep %p refcnt %u\n", sl, atomic_read(&sl->refcnt));
 				if (sleep_rem != 0) {
 					ret = -EINTR;
 					goto post_unlock;
diff --git a/drivers/w1/w1.c b/drivers/w1/w1.c
index 181f41c..9bb3116 100644
--- a/drivers/w1/w1.c
+++ b/drivers/w1/w1.c
@@ -650,8 +650,10 @@ static int w1_family_notify(unsigned long action, struct w1_slave *sl)
 	case BUS_NOTIFY_DEL_DEVICE:
 		if (fops->remove_slave)
 			sl->family->fops->remove_slave(sl);
+		printk(KERN_NOTICE "calling sysfs_remove_groups\n");
 		if (fops->groups)
 			sysfs_remove_groups(&sl->dev.kobj, fops->groups);
+		printk(KERN_NOTICE "returned sysfs_remove_groups\n");
 		break;
 	}
 	return 0;
@@ -769,6 +771,7 @@ int w1_unref_slave(struct w1_slave *sl)
 	int refcnt;
 	mutex_lock(&dev->list_mutex);
 	refcnt = atomic_sub_return(1, &sl->refcnt);
+	printk(KERN_NOTICE "%s slave %p refcnt now %u\n", __func__, sl, refcnt);
 	if (refcnt == 0) {
 		struct w1_netlink_msg msg;
 
@@ -788,6 +791,7 @@ int w1_unref_slave(struct w1_slave *sl)
 		memset(sl, 0, sizeof(*sl));
 		#endif
 		kfree(sl);
+		printk(KERN_NOTICE "%s w1_slave %p\n", __func__, sl);
 	}
 	atomic_dec(&dev->refcnt);
 	mutex_unlock(&dev->list_mutex);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-08 21:14               ` David Fries
@ 2015-03-09 22:47                 ` Thorsten Bschorr
  2015-03-09 23:09                   ` David Fries
  0 siblings, 1 reply; 24+ messages in thread
From: Thorsten Bschorr @ 2015-03-09 22:47 UTC (permalink / raw)
  To: David Fries; +Cc: ??????? ???????, Jonathan ALIBERT, linux-kernel

> Here's a different strategy, add a w1_therm family_data specific mutex
> so w1_therm_remove_slave won't return while another thread is in
> w1_slave_show.  Thoughts?
>
> I included three patches, the first is the proposed fix, the second
> makes it more reproducible (and since my testing doesn't have external
> power I had to ignore that check), the third is just some debugging
> messages.  For testing I'm starting a read from w1_slave then manually
> remove the sensor, which seems to do the trick.

I'll test your patch, but keeping the original sleep-time tm.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-09 22:47                 ` Thorsten Bschorr
@ 2015-03-09 23:09                   ` David Fries
  2015-03-10  0:05                     ` Thorsten Bschorr
  2015-03-10 13:52                     ` Evgeniy Polyakov
  0 siblings, 2 replies; 24+ messages in thread
From: David Fries @ 2015-03-09 23:09 UTC (permalink / raw)
  To: Thorsten Bschorr; +Cc: ??????? ???????, Jonathan ALIBERT, linux-kernel

On Mon, Mar 09, 2015 at 11:47:24PM +0100, Thorsten Bschorr wrote:
> > Here's a different strategy, add a w1_therm family_data specific mutex
> > so w1_therm_remove_slave won't return while another thread is in
> > w1_slave_show.  Thoughts?
> >
> > I included three patches, the first is the proposed fix, the second
> > makes it more reproducible (and since my testing doesn't have external
> > power I had to ignore that check), the third is just some debugging
> > messages.  For testing I'm starting a read from w1_slave then manually
> > remove the sensor, which seems to do the trick.
> 
> I'll test your patch, but keeping the original sleep-time tm.

Oops, sorry, it got lost in the shuffle, here's the first patch again
(the others were for debugging and increase that time and so wouldn't
go upstream anyway).

>From 777f5fd75f5f99a3352863e83d226c7b65ebdaa4 Mon Sep 17 00:00:00 2001
From: David Fries <David@Fries.net>
Date: Sat, 7 Mar 2015 22:25:37 -0600
Subject: [PATCH] w1_therm, don't let the slave go away while in w1_slave_show

A temperature conversion can take 750 ms to complete, if the sensor is
externally powered it releases the bus_mutex while it waits, but if
the slave device is removed sl becomes a dangling pointer.
The race condition window can be 10 * 750 ms = 7.5 seconds if the crc
check fails.

Reported-By: Thorsten Bschorr <thorsten@bschorr.de>
Signed-off-by: David Fries <David@Fries.net>
---
 drivers/w1/slaves/w1_therm.c | 51 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/drivers/w1/slaves/w1_therm.c b/drivers/w1/slaves/w1_therm.c
index 1f11a20..39a9e6a 100644
--- a/drivers/w1/slaves/w1_therm.c
+++ b/drivers/w1/slaves/w1_therm.c
@@ -59,9 +59,20 @@ MODULE_ALIAS("w1-family-" __stringify(W1_THERM_DS28EA00));
 static int w1_strong_pullup = 1;
 module_param_named(strong_pullup, w1_strong_pullup, int, 0);
 
+struct w1_therm_family_data {
+	uint8_t rom[9];
+	struct mutex lock;
+};
+
+/* return the address of the lock in the family data */
+#define THERM_LOCK(sl) \
+	(&((struct w1_therm_family_data*)sl->family_data)->lock)
+
 static int w1_therm_add_slave(struct w1_slave *sl)
 {
-	sl->family_data = kzalloc(9, GFP_KERNEL);
+	sl->family_data = kzalloc(sizeof(struct w1_therm_family_data),
+		GFP_KERNEL);
+	mutex_init(THERM_LOCK(sl));
 	if (!sl->family_data)
 		return -ENOMEM;
 	return 0;
@@ -69,6 +80,11 @@ static int w1_therm_add_slave(struct w1_slave *sl)
 
 static void w1_therm_remove_slave(struct w1_slave *sl)
 {
+	/* Getting the lock means w1_slave_show isn't sleeping and the
+	 * family_data can be freed.
+	 */
+	mutex_lock(THERM_LOCK(sl));
+	mutex_unlock(THERM_LOCK(sl));
 	kfree(sl->family_data);
 	sl->family_data = NULL;
 }
@@ -194,13 +210,15 @@ static ssize_t w1_slave_show(struct device *device,
 	struct w1_slave *sl = dev_to_w1_slave(device);
 	struct w1_master *dev = sl->master;
 	u8 rom[9], crc, verdict, external_power;
-	int i, max_trying = 10;
+	int i, ret, max_trying = 10;
 	ssize_t c = PAGE_SIZE;
 
-	i = mutex_lock_interruptible(&dev->bus_mutex);
-	if (i != 0)
-		return i;
+	ret = mutex_lock_interruptible(&dev->bus_mutex);
+	if (ret != 0)
+		goto post_unlock;
 
+	/* prevent the slave from going away in sleep */
+	mutex_lock(THERM_LOCK(sl));
 	memset(rom, 0, sizeof(rom));
 
 	while (max_trying--) {
@@ -230,17 +248,19 @@ static ssize_t w1_slave_show(struct device *device,
 				mutex_unlock(&dev->bus_mutex);
 
 				sleep_rem = msleep_interruptible(tm);
-				if (sleep_rem != 0)
-					return -EINTR;
+				if (sleep_rem != 0) {
+					ret = -EINTR;
+					goto post_unlock;
+				}
 
-				i = mutex_lock_interruptible(&dev->bus_mutex);
-				if (i != 0)
-					return i;
+				ret = mutex_lock_interruptible(&dev->bus_mutex);
+				if (ret != 0)
+					goto post_unlock;
 			} else if (!w1_strong_pullup) {
 				sleep_rem = msleep_interruptible(tm);
 				if (sleep_rem != 0) {
-					mutex_unlock(&dev->bus_mutex);
-					return -EINTR;
+					ret = -EINTR;
+					goto pre_unlock;
 				}
 			}
 
@@ -279,9 +299,14 @@ static ssize_t w1_slave_show(struct device *device,
 
 	c -= snprintf(buf + PAGE_SIZE - c, c, "t=%d\n",
 		w1_convert_temp(rom, sl->family->fid));
+	ret = PAGE_SIZE - c;
+
+pre_unlock:
 	mutex_unlock(&dev->bus_mutex);
 
-	return PAGE_SIZE - c;
+post_unlock:
+	mutex_unlock(THERM_LOCK(sl));
+	return ret;
 }
 
 static int __init w1_therm_init(void)
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-09 23:09                   ` David Fries
@ 2015-03-10  0:05                     ` Thorsten Bschorr
  2015-03-10  0:34                       ` Thorsten Bschorr
  2015-03-10 13:52                     ` Evgeniy Polyakov
  1 sibling, 1 reply; 24+ messages in thread
From: Thorsten Bschorr @ 2015-03-10  0:05 UTC (permalink / raw)
  To: David Fries; +Cc: ??????? ???????, Jonathan ALIBERT, linux-kernel

> Oops, sorry, it got lost in the shuffle, here's the first patch again
> (the others were for debugging and increase that time and so wouldn't
> go upstream anyway).

I assumed so.


It looks like your patch runs into dead locks problems:

I have a cron job reading my sensors. If I read the sensors on another
thread (e.g. via cat), the 2nd thread can produce a dead lock:

* thread 1 has bus & sl lock
* thread 1 drops bus lock, keeps sl locks and sleeps
* thread 2 get bus lock, waits for sl lock
* thread 1 returns from sleep, waits for bus lock, but this is help by thread 2


After a couple of seconds, I get the following dump (on 3.18.9):

Mar 10 00:29:33 pi kernel: [  481.184239] cat             D c0546c48
  0  2523   2422 0x00000000
Mar 10 00:29:33 pi kernel: [  481.184320] [<c0546c48>] (__schedule)
from [<c0547130>] (schedule+0x40/0x8c)
Mar 10 00:29:33 pi kernel: [  481.184357] [<c0547130>] (schedule) from
[<c05473f0>] (schedule_preempt_disabled+0x30/0x40)
Mar 10 00:29:33 pi kernel: [  481.184393] [<c05473f0>]
(schedule_preempt_disabled) from [<c0548dc8>]
(__mutex_lock_slowpath+0xa8/0x174)
Mar 10 00:29:33 pi kernel: [  481.184424] [<c0548dc8>]
(__mutex_lock_slowpath) from [<c0548ecc>] (mutex_lock+0x38/0x3c)
Mar 10 00:29:33 pi kernel: [  481.184465] [<c0548ecc>] (mutex_lock)
from [<bf0220f4>] (w1_slave_show+0x60/0x3ec [w1_therm])
Mar 10 00:29:33 pi kernel: [  481.184517] [<bf0220f4>] (w1_slave_show
[w1_therm]) from [<c03565bc>] (dev_attr_show+0x2c/0x58)
Mar 10 00:29:33 pi kernel: [  481.184558] [<c03565bc>] (dev_attr_show)
from [<c01a5f54>] (sysfs_kf_seq_show+0x9c/0x120)
Mar 10 00:29:33 pi kernel: [  481.184589] [<c01a5f54>]
(sysfs_kf_seq_show) from [<c01a479c>] (kernfs_seq_show+0x34/0x38)
Mar 10 00:29:33 pi kernel: [  481.184624] [<c01a479c>]
(kernfs_seq_show) from [<c015905c>] (seq_read+0x1ac/0x4c4)
Mar 10 00:29:33 pi kernel: [  481.184653] [<c015905c>] (seq_read) from
[<c01a52d4>] (kernfs_fop_read+0x11c/0x164)
Mar 10 00:29:33 pi kernel: [  481.184691] [<c01a52d4>]
(kernfs_fop_read) from [<c0135a98>] (vfs_read+0x98/0x18c)
Mar 10 00:29:33 pi kernel: [  481.184721] [<c0135a98>] (vfs_read) from
[<c0135bd8>] (SyS_read+0x4c/0x98)
Mar 10 00:29:33 pi kernel: [  481.184758] [<c0135bd8>] (SyS_read) from
[<c000eb40>] (ret_fast_syscall+0x0/0x48)


In principle, w1_slave_show  only needs a read-lock on the sl data,
and only w1_therm_remove_slave needs a write-lock. This would allow
multiple concurrent temperature readings (as before).


BTW: in w1_therm_add_slave the mutex_init call should be after if
(!sl->family_data), otherwise one might get another null pointer
issue.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-10  0:05                     ` Thorsten Bschorr
@ 2015-03-10  0:34                       ` Thorsten Bschorr
  2015-03-12  0:44                         ` David Fries
  0 siblings, 1 reply; 24+ messages in thread
From: Thorsten Bschorr @ 2015-03-10  0:34 UTC (permalink / raw)
  To: David Fries; +Cc: ??????? ???????, Jonathan ALIBERT, linux-kernel

> It looks like your patch runs into dead locks problems:
>
> I have a cron job reading my sensors. If I read the sensors on another
> thread (e.g. via cat), the 2nd thread can produce a dead lock:
>
> * thread 1 has bus & sl lock
> * thread 1 drops bus lock, keeps sl locks and sleeps
> * thread 2 get bus lock, waits for sl lock
> * thread 1 returns from sleep, waits for bus lock, but this is help by thread 2


Aquiring  sl lock before the bus lock prevents this dead lock (no
change of locking-order).
With search enabled, removing a device by w1_search_process_cb then
also worked as intended:

Mar 10 01:29:04 pi kernel: [  924.870893] w1_therm_remove_slave about
to lock faily_data->lock
Mar 10 01:29:04 pi kernel: [  924.870935] w1_therm_remove_slave
unlocked faily_data->lock

Mar 10 01:29:04 pi kernel: [  924.871115] w1_therm_remove_slave about
to lock faily_data->lock
Mar 10 01:29:05 pi kernel: [  925.151242] start sleep d117a600 refcnt 1 **
Mar 10 01:29:05 pi kernel: [  925.151277] end sleep d117a600 refcnt 0 **
Mar 10 01:29:11 pi kernel: [  931.295344] w1_slave_driver
10-000802cc045a: Read failed CRC check
Mar 10 01:29:11 pi kernel: [  931.295437] w1_therm_remove_slave
unlocked faily_data->lock
** I get the ref-cnt before and after the sleep and only log if they differ.


But I am unable to judge if mobing the sl lock at the beginning of
w1_slave_show can cause dead locks in other scenarios.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-09 23:09                   ` David Fries
  2015-03-10  0:05                     ` Thorsten Bschorr
@ 2015-03-10 13:52                     ` Evgeniy Polyakov
  2015-03-12  0:54                       ` David Fries
  1 sibling, 1 reply; 24+ messages in thread
From: Evgeniy Polyakov @ 2015-03-10 13:52 UTC (permalink / raw)
  To: David Fries, Thorsten Bschorr; +Cc: Jonathan ALIBERT, linux-kernel

Hi

10.03.2015, 02:09, "David Fries" <david@fries.net>:

> diff --git a/drivers/w1/slaves/w1_therm.c b/drivers/w1/slaves/w1_therm.c
> index 1f11a20..39a9e6a 100644
> --- a/drivers/w1/slaves/w1_therm.c
> +++ b/drivers/w1/slaves/w1_therm.c
> @@ -59,9 +59,20 @@ MODULE_ALIAS("w1-family-" __stringify(W1_THERM_DS28EA00));
>  static int w1_strong_pullup = 1;
>  module_param_named(strong_pullup, w1_strong_pullup, int, 0);
>
> +struct w1_therm_family_data {
> + uint8_t rom[9];
> + struct mutex lock;
> +};

This approach will not scale to other w1 families, I would rather prefer solutions on w1 level,
not in particular drivers. What if we drop slave reference counter at all in favor of automatic sysfs device management?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-10  0:34                       ` Thorsten Bschorr
@ 2015-03-12  0:44                         ` David Fries
  0 siblings, 0 replies; 24+ messages in thread
From: David Fries @ 2015-03-12  0:44 UTC (permalink / raw)
  To: Thorsten Bschorr; +Cc: ??????? ???????, Jonathan ALIBERT, linux-kernel

On Tue, Mar 10, 2015 at 01:34:14AM +0100, Thorsten Bschorr wrote:
> > It looks like your patch runs into dead locks problems:

I was testing reading and removing the slave, I didn't test two
readers, I'll keep that in mind.

For some reason I was thinking it was a try lock after the sleep, it
isn't, a signal can cause it to return a value, but yes that's not
going to work.  I have another version that uses an atomic counter in
place of this mutex, it requires a loop in the remove slave.

I have a one wire network with 14 temperature sensors on it.  What I
do is I have a program that talks to the kernel over netlink, which is
now non-blocking for one wire after an earlier set of changes I made.
It sends out all the conversion request requests, waits, sends the
read request packets, then gets back the results, and relays the
results to the program that requested them.  It doesn't use the
w1_therm and avoids all these problems.  If I read them sequentially
with w1_therm, that would take more than 14*.750 or 10.5 seconds,
there's no way with w1_therm to read them at once without having 14
threads or programs doing so, the problem is it takes a lot more
programming to do netlink.

> > I have a cron job reading my sensors. If I read the sensors on another
> > thread (e.g. via cat), the 2nd thread can produce a dead lock:
> >
> > * thread 1 has bus & sl lock
> > * thread 1 drops bus lock, keeps sl locks and sleeps
> > * thread 2 get bus lock, waits for sl lock
> > * thread 1 returns from sleep, waits for bus lock, but this is help by thread 2
> 
> 
> Aquiring  sl lock before the bus lock prevents this dead lock (no
> change of locking-order).
> With search enabled, removing a device by w1_search_process_cb then
> also worked as intended:
> 
> Mar 10 01:29:04 pi kernel: [  924.870893] w1_therm_remove_slave about
> to lock faily_data->lock
> Mar 10 01:29:04 pi kernel: [  924.870935] w1_therm_remove_slave
> unlocked faily_data->lock
> 
> Mar 10 01:29:04 pi kernel: [  924.871115] w1_therm_remove_slave about
> to lock faily_data->lock
> Mar 10 01:29:05 pi kernel: [  925.151242] start sleep d117a600 refcnt 1 **
> Mar 10 01:29:05 pi kernel: [  925.151277] end sleep d117a600 refcnt 0 **
> Mar 10 01:29:11 pi kernel: [  931.295344] w1_slave_driver
> 10-000802cc045a: Read failed CRC check
> Mar 10 01:29:11 pi kernel: [  931.295437] w1_therm_remove_slave
> unlocked faily_data->lock
> ** I get the ref-cnt before and after the sleep and only log if they differ.
> 
> 
> But I am unable to judge if mobing the sl lock at the beginning of
> w1_slave_show can cause dead locks in other scenarios.

I'm not sure, I would probably switch back to the referencing counting
version I wrote earlier, or make the bus mutex lock a timed lock or
try lock first.

-- 
David Fries <david@fries.net>    PGP pub CB1EE8F0
http://fries.net/~david/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-10 13:52                     ` Evgeniy Polyakov
@ 2015-03-12  0:54                       ` David Fries
  2015-03-14 20:55                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 24+ messages in thread
From: David Fries @ 2015-03-12  0:54 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Thorsten Bschorr, Jonathan ALIBERT, linux-kernel

On Tue, Mar 10, 2015 at 04:52:00PM +0300, Evgeniy Polyakov wrote:
> Hi
> 
> 10.03.2015, 02:09, "David Fries" <david@fries.net>:
> 
> > diff --git a/drivers/w1/slaves/w1_therm.c b/drivers/w1/slaves/w1_therm.c
> > index 1f11a20..39a9e6a 100644
> > --- a/drivers/w1/slaves/w1_therm.c
> > +++ b/drivers/w1/slaves/w1_therm.c
> > @@ -59,9 +59,20 @@ MODULE_ALIAS("w1-family-" __stringify(W1_THERM_DS28EA00));
> >  static int w1_strong_pullup = 1;
> >  module_param_named(strong_pullup, w1_strong_pullup, int, 0);
> >
> > +struct w1_therm_family_data {
> > + uint8_t rom[9];
> > + struct mutex lock;
> > +};
> 
> This approach will not scale to other w1 families, I would rather prefer solutions on w1 level,
> not in particular drivers. What if we drop slave reference counter at all in favor of automatic sysfs device management?

I looked and didn't see any of the other slaves dropping the lock and
being in this situation, but that doesn't mean they won't in the
future.  Personally I'm just using netlink and don't plan on using any
of the slave drivers.

Would that be removing all four refcnt, w1_slave, w1_master,
w1_family, w1_cb_block, or just some of them?  It sounds good to me,
if that had bugs there would be much more than just the w1 system
relying on it.  I don't know enough about that system or have the time
to code up that change.

I can take another look at and post the reference counting w1_therm
fix instead of the mutex version as a near term work around until that
is available if you want.

-- 
David Fries <david@fries.net>    PGP pub CB1EE8F0
http://fries.net/~david/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-12  0:54                       ` David Fries
@ 2015-03-14 20:55                         ` Evgeniy Polyakov
  2015-03-18  4:20                           ` David Fries
  0 siblings, 1 reply; 24+ messages in thread
From: Evgeniy Polyakov @ 2015-03-14 20:55 UTC (permalink / raw)
  To: David Fries; +Cc: Thorsten Bschorr, Jonathan ALIBERT, linux-kernel

Hi David

12.03.2015, 03:54, "David Fries" <david@fries.net>:
> Would that be removing all four refcnt, w1_slave, w1_master,
> w1_family, w1_cb_block, or just some of them?  It sounds good to me,
> if that had bugs there would be much more than just the w1 system
> relying on it.  I don't know enough about that system or have the time
> to code up that change.
>
> I can take another look at and post the reference counting w1_therm
> fix instead of the mutex version as a near term work around until that
> is available if you want.

Please cook up a quick fix for this problem - this bug really hurts people.
And then we will discuss how 'ideal' life cycle should look

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-14 20:55                         ` Evgeniy Polyakov
@ 2015-03-18  4:20                           ` David Fries
  2015-03-18 15:18                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 24+ messages in thread
From: David Fries @ 2015-03-18  4:20 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Thorsten Bschorr, Jonathan ALIBERT, linux-kernel

On Sat, Mar 14, 2015 at 11:55:16PM +0300, Evgeniy Polyakov wrote:
> Hi David
> 
> 12.03.2015, 03:54, "David Fries" <david@fries.net>:
> > Would that be removing all four refcnt, w1_slave, w1_master,
> > w1_family, w1_cb_block, or just some of them?  It sounds good to me,
> > if that had bugs there would be much more than just the w1 system
> > relying on it.  I don't know enough about that system or have the time
> > to code up that change.
> >
> > I can take another look at and post the reference counting w1_therm
> > fix instead of the mutex version as a near term work around until that
> > is available if you want.
> 
> Please cook up a quick fix for this problem - this bug really hurts people.
> And then we will discuss how 'ideal' life cycle should look

Done, I don't like it, I'm not sure anyone else will either, but I'm
no longer crashing in testing, so that's an improvement.  My
"production" system doesn't use w1_therm, so I only see these bugs in
development testing it.  I've come to the conclusion that in the face
of a slave vanishing, w1_therm can't avoid all the race conditions, so
the real fix must be elsewhere.

>From 51d4024ca667c8b712de462489d125a78e85aa57 Mon Sep 17 00:00:00 2001
From: David Fries <David@Fries.net>
Date: Sat, 7 Mar 2015 22:25:37 -0600
Subject: [PATCH] w1_therm, reduce race conditions in w1_slave_show

After applying this patch commands such as the following in one
process,

slave=28-000002c95fb1
while true; do echo $slave > /sys/devices/w1_bus_master1/w1_master_add; sleep .1; echo $slave > /sys/devices/w1_bus_master1/w1_master_remove; sleep .1; done

and then two at the same time in two other processes,
slave=28-000002c95fb1
while true; do time cat /sys/devices/w1_bus_master1/$slave/w1_slave ; sleep .1; done

then randomly stop all three and repeat.

With this patch I no longer see crashes, but at best this patch
effectively hiding the result of a race condition.  sl->family_data is
being freed and set to NULL in the slave removal while the
w1_slave_show is then dereferencing it, this holds on to the pointer
meaning it's probably clobbering memory now instead of crashing.  I
wonder if that would make RCU be a fit for this?  The original bug
report was pointing the problem as unlocking bus_mutex while waiting
for the temperature conversion, but I was getting sl->family_data set
to NULL more reliable without external power which means bux_mutex was
held for the duration of w1_slave_show, which is not to say that the
original bug report wasn't correct, it is to say that even with the
spinlock, holding bus_mutex on the slave, isn't sufficient to keep the
slave from being removed.

Reported-By: Thorsten Bschorr <thorsten@bschorr.de>
---
 drivers/w1/slaves/w1_therm.c |   70 +++++++++++++++++++++++++++++++++---------
 1 file changed, 55 insertions(+), 15 deletions(-)

diff --git a/drivers/w1/slaves/w1_therm.c b/drivers/w1/slaves/w1_therm.c
index 1f11a20..403285d 100644
--- a/drivers/w1/slaves/w1_therm.c
+++ b/drivers/w1/slaves/w1_therm.c
@@ -59,16 +59,32 @@ MODULE_ALIAS("w1-family-" __stringify(W1_THERM_DS28EA00));
 static int w1_strong_pullup = 1;
 module_param_named(strong_pullup, w1_strong_pullup, int, 0);
 
+struct w1_therm_family_data {
+	uint8_t rom[9];
+	atomic_t refcnt;
+};
+
+/* return the address of the refcnt in the family data */
+#define THERM_REFCNT(family_data) \
+	(&((struct w1_therm_family_data*)family_data)->refcnt)
+
 static int w1_therm_add_slave(struct w1_slave *sl)
 {
-	sl->family_data = kzalloc(9, GFP_KERNEL);
+	sl->family_data = kzalloc(sizeof(struct w1_therm_family_data),
+		GFP_KERNEL);
 	if (!sl->family_data)
 		return -ENOMEM;
+	atomic_set(THERM_REFCNT(sl->family_data), 1);
 	return 0;
 }
 
 static void w1_therm_remove_slave(struct w1_slave *sl)
 {
+	int refcnt = atomic_sub_return(1, THERM_REFCNT(sl->family_data));
+	while(refcnt) {
+		msleep(1000);
+		refcnt = atomic_read(THERM_REFCNT(sl->family_data));
+	}
 	kfree(sl->family_data);
 	sl->family_data = NULL;
 }
@@ -194,13 +210,30 @@ static ssize_t w1_slave_show(struct device *device,
 	struct w1_slave *sl = dev_to_w1_slave(device);
 	struct w1_master *dev = sl->master;
 	u8 rom[9], crc, verdict, external_power;
-	int i, max_trying = 10;
+	int i, ret, max_trying = 10;
 	ssize_t c = PAGE_SIZE;
+	u8 *family_data = sl->family_data;
+
+	ret = mutex_lock_interruptible(&dev->bus_mutex);
+	if (ret != 0)
+		goto post_unlock;
 
-	i = mutex_lock_interruptible(&dev->bus_mutex);
-	if (i != 0)
-		return i;
+	if(!sl->family_data)
+	{
+		ret = -ENODEV;
+		/* Note for anyoe who actually saw this message, it is a known
+		 * problem with either slave drivers or this driver in
+		 * particular and the request is only a canary indication as
+		 * to how many people and how often it is being ran into.
+		 */
+		printk(KERN_NOTICE
+			"%s: %u sl->family_data is NULL please report\n",
+			__FILE__, __LINE__);
+		goto pre_unlock;
+	}
 
+	/* prevent the slave from going away in sleep */
+	atomic_inc(THERM_REFCNT(family_data));
 	memset(rom, 0, sizeof(rom));
 
 	while (max_trying--) {
@@ -230,17 +263,19 @@ static ssize_t w1_slave_show(struct device *device,
 				mutex_unlock(&dev->bus_mutex);
 
 				sleep_rem = msleep_interruptible(tm);
-				if (sleep_rem != 0)
-					return -EINTR;
+				if (sleep_rem != 0) {
+					ret = -EINTR;
+					goto post_unlock;
+				}
 
-				i = mutex_lock_interruptible(&dev->bus_mutex);
-				if (i != 0)
-					return i;
+				ret = mutex_lock_interruptible(&dev->bus_mutex);
+				if (ret != 0)
+					goto post_unlock;
 			} else if (!w1_strong_pullup) {
 				sleep_rem = msleep_interruptible(tm);
 				if (sleep_rem != 0) {
-					mutex_unlock(&dev->bus_mutex);
-					return -EINTR;
+					ret = -EINTR;
+					goto pre_unlock;
 				}
 			}
 
@@ -269,19 +304,24 @@ static ssize_t w1_slave_show(struct device *device,
 	c -= snprintf(buf + PAGE_SIZE - c, c, ": crc=%02x %s\n",
 			   crc, (verdict) ? "YES" : "NO");
 	if (verdict)
-		memcpy(sl->family_data, rom, sizeof(rom));
+		memcpy(family_data, rom, sizeof(rom));
 	else
 		dev_warn(device, "Read failed CRC check\n");
 
 	for (i = 0; i < 9; ++i)
 		c -= snprintf(buf + PAGE_SIZE - c, c, "%02x ",
-			      ((u8 *)sl->family_data)[i]);
+			      ((u8 *)family_data)[i]);
 
 	c -= snprintf(buf + PAGE_SIZE - c, c, "t=%d\n",
 		w1_convert_temp(rom, sl->family->fid));
+	ret = PAGE_SIZE - c;
+
+pre_unlock:
 	mutex_unlock(&dev->bus_mutex);
 
-	return PAGE_SIZE - c;
+post_unlock:
+	atomic_dec(THERM_REFCNT(family_data));
+	return ret;
 }
 
 static int __init w1_therm_init(void)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-18  4:20                           ` David Fries
@ 2015-03-18 15:18                             ` Evgeniy Polyakov
  2015-03-19  0:09                               ` David Fries
  0 siblings, 1 reply; 24+ messages in thread
From: Evgeniy Polyakov @ 2015-03-18 15:18 UTC (permalink / raw)
  To: David Fries; +Cc: Thorsten Bschorr, Jonathan ALIBERT, linux-kernel

Hi

18.03.2015, 07:20, "David Fries" <david@fries.net>:
>  static void w1_therm_remove_slave(struct w1_slave *sl)
>  {
> + int refcnt = atomic_sub_return(1, THERM_REFCNT(sl->family_data));
> + while(refcnt) {
> + msleep(1000);
> + refcnt = atomic_read(THERM_REFCNT(sl->family_data));
> + }
>          kfree(sl->family_data);
>          sl->family_data = NULL;
>  }

Can we replace this whole atomic manipulations with kref_t and free family data in the place
which actually drops reference counter to zero?

I.e. we return from remove_slave() function potentially leaving family data floating around, it will be freed
when the last user drops the reference. There is still a race between increasing reference when starting
reading and removing slave device, i.e. one starts reading, while attached slave device is being removed,
but that's a different problem.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-03-18 15:18                             ` Evgeniy Polyakov
@ 2015-03-19  0:09                               ` David Fries
       [not found]                                 ` <CADq11+-tZmQEVUL=sHC64i4auC_5i=+y2yBcMTaJMdD5Z0dE6w@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: David Fries @ 2015-03-19  0:09 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Thorsten Bschorr, Jonathan ALIBERT, linux-kernel

On Wed, Mar 18, 2015 at 06:18:53PM +0300, Evgeniy Polyakov wrote:
> Hi
> 
> 18.03.2015, 07:20, "David Fries" <david@fries.net>:
> >  static void w1_therm_remove_slave(struct w1_slave *sl)
> >  {
> > + int refcnt = atomic_sub_return(1, THERM_REFCNT(sl->family_data));
> > + while(refcnt) {
> > + msleep(1000);
> > + refcnt = atomic_read(THERM_REFCNT(sl->family_data));
> > + }
> >          kfree(sl->family_data);
> >          sl->family_data = NULL;
> >  }
> 
> Can we replace this whole atomic manipulations with kref_t and free family data in the place
> which actually drops reference counter to zero?
> 
> I.e. we return from remove_slave() function potentially leaving family data floating around, it will be freed
> when the last user drops the reference. There is still a race between increasing reference when starting
> reading and removing slave device, i.e. one starts reading, while attached slave device is being removed,
> but that's a different problem.

With the two while loops I posted, I see with two clients reading
w1_slave, the other command to remove a slave gets permanently stuck
in w1_therm_remove_slave, which keeps the slave around while the
clients continue to read.  I wouldn't predict things going better by
keeping family_data around longer, the slave data would still go away
with readers around.

-- 
David Fries <david@fries.net>    PGP pub CB1EE8F0
http://fries.net/~david/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
       [not found]                                 ` <CADq11+-tZmQEVUL=sHC64i4auC_5i=+y2yBcMTaJMdD5Z0dE6w@mail.gmail.com>
@ 2015-04-16  3:51                                   ` David Fries
  2015-04-16 11:57                                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 24+ messages in thread
From: David Fries @ 2015-04-16  3:51 UTC (permalink / raw)
  To: Jonathan ALIBERT; +Cc: Evgeniy Polyakov, Thorsten Bschorr, linux-kernel

It has not been solved.  Evgeniy would like to make use of the sysfs
device management instead of the current reference counting, however I
haven't heard any volunteers to do that work.  I posted a quick fix
patch, it was very easy to crash without this patch, it doesn't
completely solve the race conditions, and I don't think it can be
solved in just a slave driver change.

Are you up for the challenge?

On Wed, Apr 15, 2015 at 09:52:27AM +0200, Jonathan ALIBERT wrote:
> Hi,
> 
> Do you know if the problem has been solved ?
> 
> Cheers,
> 
> *Jonathan ALIBERT*
> *06 32 26 59 12*
> *265, route de Saint Haon*
> *42 370 RENAISON*
> 
> 
> 2015-03-19 1:09 GMT+01:00 David Fries <david@fries.net>:
> 
> > On Wed, Mar 18, 2015 at 06:18:53PM +0300, Evgeniy Polyakov wrote:
> > > Hi
> > >
> > > 18.03.2015, 07:20, "David Fries" <david@fries.net>:
> > > >  static void w1_therm_remove_slave(struct w1_slave *sl)
> > > >  {
> > > > + int refcnt = atomic_sub_return(1, THERM_REFCNT(sl->family_data));
> > > > + while(refcnt) {
> > > > + msleep(1000);
> > > > + refcnt = atomic_read(THERM_REFCNT(sl->family_data));
> > > > + }
> > > >          kfree(sl->family_data);
> > > >          sl->family_data = NULL;
> > > >  }
> > >
> > > Can we replace this whole atomic manipulations with kref_t and free
> > family data in the place
> > > which actually drops reference counter to zero?
> > >
> > > I.e. we return from remove_slave() function potentially leaving family
> > data floating around, it will be freed
> > > when the last user drops the reference. There is still a race between
> > increasing reference when starting
> > > reading and removing slave device, i.e. one starts reading, while
> > attached slave device is being removed,
> > > but that's a different problem.
> >
> > With the two while loops I posted, I see with two clients reading
> > w1_slave, the other command to remove a slave gets permanently stuck
> > in w1_therm_remove_slave, which keeps the slave around while the
> > clients continue to read.  I wouldn't predict things going better by
> > keeping family_data around longer, the slave data would still go away
> > with readers around.
> >
> > --
> > David Fries <david@fries.net>    PGP pub CB1EE8F0
> > http://fries.net/~david/
> >

-- 
David Fries <david@fries.net>    PGP pub CB1EE8F0
http://fries.net/~david/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
  2015-04-16  3:51                                   ` David Fries
@ 2015-04-16 11:57                                     ` Evgeniy Polyakov
       [not found]                                       ` <CA+1k2cFw+2NTOtbSaJ1S=kBAn2Mj62DTeZo68V9t1Wk-7m7GyA@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Evgeniy Polyakov @ 2015-04-16 11:57 UTC (permalink / raw)
  To: David Fries, Jonathan ALIBERT; +Cc: Thorsten Bschorr, linux-kernel

Hi David

16.04.2015, 06:52, "David Fries" <david@fries.net>:
> It has not been solved.  Evgeniy would like to make use of the sysfs
> device management instead of the current reference counting, however I
> haven't heard any volunteers to do that work.  I posted a quick fix
> patch, it was very easy to crash without this patch, it doesn't
> completely solve the race conditions, and I don't think it can be
> solved in just a slave driver change.

Let's push this patch upstream as a temporal fix until we are ready with the new solution.
Thorsten, does it fix your crash?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Avoid null-pointer access in w1/slaves/w1_therm
       [not found]                                       ` <CA+1k2cFw+2NTOtbSaJ1S=kBAn2Mj62DTeZo68V9t1Wk-7m7GyA@mail.gmail.com>
@ 2015-04-17 12:55                                         ` Evgeniy Polyakov
  0 siblings, 0 replies; 24+ messages in thread
From: Evgeniy Polyakov @ 2015-04-17 12:55 UTC (permalink / raw)
  To: Thorsten Bschorr; +Cc: David Fries, Jonathan ALIBERT, linux-kernel

Hi

16.04.2015, 15:00, "Thorsten Bschorr" <thorsten@bschorr.de>:
>> Let's push this patch upstream as a temporal fix until we are ready with the new solution.
>> Thorsten, does it fix your crash?
>
> I'm running David's refcounting-patch for about 3 weeks now on my 3.18.y kernel, and did not observe any problems with it, especially no crash or dead-lock.

David, please send a formal patch to linux-kernel@ and Greg Kroah-Hartman
with signed-of and acked-by lines

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-04-17 13:03 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CA+1k2cH5Y+RngvEbgE3CW5g+bO3V1ytDJC=u6DLVDZvbEOhu5A@mail.gmail.com>
     [not found] ` <CA+1k2cF1frcEUu-L_cSJxTp=GKExn6Vt2rdCeY=zrhM62FUggw@mail.gmail.com>
     [not found]   ` <CA+1k2cEwtY+U7TS3rpmMp5nEBokO8vwhcOiD0ExVQnuoB=XVLQ@mail.gmail.com>
2015-02-23 17:09     ` Fwd: w1/slaves/w1_therm: null-ptr access of sl->family_data Thorsten Bschorr
2015-02-24  1:37       ` David Fries
2015-02-25  9:28         ` Thorsten Bschorr
2015-02-27  8:43 ` [PATCH] Avoid null-pointer access in w1/slaves/w1_therm Thorsten.Bschorr
2015-02-28 20:17   ` David Fries
     [not found]     ` <369891425174502@web4m.yandex.ru>
2015-03-01  2:17       ` David Fries
2015-03-01 13:04         ` Thorsten Bschorr
2015-03-02  0:17           ` David Fries
2015-03-04 15:36             ` Евгений Поляков
2015-03-08 21:14               ` David Fries
2015-03-09 22:47                 ` Thorsten Bschorr
2015-03-09 23:09                   ` David Fries
2015-03-10  0:05                     ` Thorsten Bschorr
2015-03-10  0:34                       ` Thorsten Bschorr
2015-03-12  0:44                         ` David Fries
2015-03-10 13:52                     ` Evgeniy Polyakov
2015-03-12  0:54                       ` David Fries
2015-03-14 20:55                         ` Evgeniy Polyakov
2015-03-18  4:20                           ` David Fries
2015-03-18 15:18                             ` Evgeniy Polyakov
2015-03-19  0:09                               ` David Fries
     [not found]                                 ` <CADq11+-tZmQEVUL=sHC64i4auC_5i=+y2yBcMTaJMdD5Z0dE6w@mail.gmail.com>
2015-04-16  3:51                                   ` David Fries
2015-04-16 11:57                                     ` Evgeniy Polyakov
     [not found]                                       ` <CA+1k2cFw+2NTOtbSaJ1S=kBAn2Mj62DTeZo68V9t1Wk-7m7GyA@mail.gmail.com>
2015-04-17 12:55                                         ` Evgeniy Polyakov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.