rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
       [not found] <20191231110945.10857-1-yuyufen@huawei.com>
@ 2019-12-31 14:55 ` Hou Tao
  2019-12-31 23:11   ` Paul E. McKenney
                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Hou Tao @ 2019-12-31 14:55 UTC (permalink / raw)
  To: Yufen Yu, axboe
  Cc: linux-block, ming.lei, hch, zhengchuan, yi.zhang, paulmck, joel, rcu

Hi,

On 2019/12/31 19:09, Yufen Yu wrote:
> When delete partition executes concurrently with IOs issue,
> it may cause use-after-free on part in disk_map_sector_rcu()
> as following:
snip

> 
> diff --git a/block/genhd.c b/block/genhd.c
> index ff6268970ddc..39fa8999905f 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -293,7 +293,23 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
>  		part = rcu_dereference(ptbl->part[i]);
>  
>  		if (part && sector_in_part(part, sector)) {
snip

>  			rcu_assign_pointer(ptbl->last_lookup, part);
> +			part = rcu_dereference(ptbl->part[i]);
> +			if (part == NULL) {
> +				rcu_assign_pointer(ptbl->last_lookup, NULL);
> +				break;
> +			}
>  			return part;
>  		}
>  	}

Not ensure whether the re-read can handle the following case or not:

process A                                 process B                          process C

disk_map_sector_rcu():                    delete_partition():               disk_map_sector_rcu():

rcu_read_lock

  // need to iterate partition table
  part[i] != NULL   (1)                   part[i] = NULL (2)
                                          smp_mb()
                                          last_lookup = NULL (3)
                                          call_rcu()  (4)
    last_lookup = part[i] (5)


                                                                             rcu_read_lock()
                                                                             read last_lookup return part[i] (6)
                                                                             sector_in_part() is OK (7)
                                                                             return part[i] (8)

  part[i] == NULL (9)
      last_lookup = NULL (10)
  rcu_read_unlock() (11)
                                           one RCU grace period completes
                                           __delete_partition() (12)
                                           free hd_partition (13)
                                                                             // use-after-free
                                                                             hd_struct_try_get(part[i])  (14)

* the number in the parenthesis is the sequence of events.

Maybe RCU experts can shed some light on this problem, so cc +paulmck@kernel.org, +joel@joelfernandes.org and +RCU maillist.

If the above case is possible, maybe we can fix the problem by pinning last_lookup through increasing its ref-count
(the following patch is only compile tested):

diff --git a/block/genhd.c b/block/genhd.c
index 6e8543ca6912..179e0056fae1 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -279,7 +279,14 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
 		part = rcu_dereference(ptbl->part[i]);

 		if (part && sector_in_part(part, sector)) {
-			rcu_assign_pointer(ptbl->last_lookup, part);
+			struct hd_struct *old;
+
+			if (!hd_struct_try_get(part))
+				break;
+
+			old = xchg(&ptbl->last_lookup, part);
+			if (old)
+				hd_struct_put(old);
 			return part;
 		}
 	}
@@ -1231,7 +1238,11 @@ static void disk_replace_part_tbl(struct gendisk *disk,
 	rcu_assign_pointer(disk->part_tbl, new_ptbl);

 	if (old_ptbl) {
-		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
+		struct hd_struct *part;
+
+		part = xchg(&old_ptbl->last_lookup, NULL);
+		if (part)
+			hd_struct_put(part);
 		kfree_rcu(old_ptbl, rcu_head);
 	}
 }
diff --git a/block/partition-generic.c b/block/partition-generic.c
index 98d60a59b843..441c1c591c04 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -285,7 +285,8 @@ void delete_partition(struct gendisk *disk, int partno)
 		return;

 	rcu_assign_pointer(ptbl->part[partno], NULL);
-	rcu_assign_pointer(ptbl->last_lookup, NULL);
+	if (cmpxchg(&ptbl->last_lookup, part, NULL) == part)
+		hd_struct_put(part);
 	kobject_put(part->holder_dir);
 	device_del(part_to_dev(part));

-- 
2.22.0

Regards,
Tao


> diff --git a/block/partition-generic.c b/block/partition-generic.c
> index 1d20c9cf213f..1e0065ed6f02 100644
> --- a/block/partition-generic.c
> +++ b/block/partition-generic.c
> @@ -284,6 +284,13 @@ void delete_partition(struct gendisk *disk, int partno)
>  		return;
>  
>  	rcu_assign_pointer(ptbl->part[partno], NULL);
> +	/*
> +	 * Without the memory barrier, disk_map_sector_rcu()
> +	 * may read the old value after overwriting the
> +	 * last_lookup. Then it can not clear last_lookup,
> +	 * which may cause use-after-free.
> +	 */
> +	smp_mb();
>  	rcu_assign_pointer(ptbl->last_lookup, NULL);
>  	kobject_put(part->holder_dir);
>  	device_del(part_to_dev(part));
> 


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2019-12-31 14:55 ` [PATCH] block: make sure last_lookup set as NULL after part deleted Hou Tao
@ 2019-12-31 23:11   ` Paul E. McKenney
  2020-01-01  2:33     ` htbegin
  2020-01-03 23:45     ` Joel Fernandes
  2020-01-02  1:23   ` Ming Lei
  2020-01-03 12:43   ` Yufen Yu
  2 siblings, 2 replies; 20+ messages in thread
From: Paul E. McKenney @ 2019-12-31 23:11 UTC (permalink / raw)
  To: Hou Tao
  Cc: Yufen Yu, axboe, linux-block, ming.lei, hch, zhengchuan,
	yi.zhang, joel, rcu

On Tue, Dec 31, 2019 at 10:55:47PM +0800, Hou Tao wrote:
> Hi,
> 
> On 2019/12/31 19:09, Yufen Yu wrote:
> > When delete partition executes concurrently with IOs issue,
> > it may cause use-after-free on part in disk_map_sector_rcu()
> > as following:
> snip
> 
> > 
> > diff --git a/block/genhd.c b/block/genhd.c
> > index ff6268970ddc..39fa8999905f 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -293,7 +293,23 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> >  		part = rcu_dereference(ptbl->part[i]);
> >  
> >  		if (part && sector_in_part(part, sector)) {
> snip
> 
> >  			rcu_assign_pointer(ptbl->last_lookup, part);
> > +			part = rcu_dereference(ptbl->part[i]);
> > +			if (part == NULL) {
> > +				rcu_assign_pointer(ptbl->last_lookup, NULL);
> > +				break;
> > +			}
> >  			return part;
> >  		}
> >  	}
> 
> Not ensure whether the re-read can handle the following case or not:
> 
> process A                                 process B                          process C
> 
> disk_map_sector_rcu():                    delete_partition():               disk_map_sector_rcu():
> 
> rcu_read_lock
> 
>   // need to iterate partition table
>   part[i] != NULL   (1)                   part[i] = NULL (2)
>                                           smp_mb()
>                                           last_lookup = NULL (3)
>                                           call_rcu()  (4)
>     last_lookup = part[i] (5)
> 
> 
>                                                                              rcu_read_lock()
>                                                                              read last_lookup return part[i] (6)
>                                                                              sector_in_part() is OK (7)
>                                                                              return part[i] (8)

Just for the record...

Use of RCU needs to ensure that readers cannot access the to-be-freed
structure -before- invoking call_rcu().  Which does look to happen here
with the "last_lookup = NULL".  But in addition, the callback needs to
get access to the to-be-freed structure via some sideband (usually the
structure passed to call_rcu()), not from the reader-accessible structure.

Or am I misinterpreting this sequence of events?

							Thanx, Paul

>   part[i] == NULL (9)
>       last_lookup = NULL (10)
>   rcu_read_unlock() (11)
>                                            one RCU grace period completes
>                                            __delete_partition() (12)
>                                            free hd_partition (13)
>                                                                              // use-after-free
>                                                                              hd_struct_try_get(part[i])  (14)
> 
> * the number in the parenthesis is the sequence of events.
> 
> Maybe RCU experts can shed some light on this problem, so cc +paulmck@kernel.org, +joel@joelfernandes.org and +RCU maillist.
> 
> If the above case is possible, maybe we can fix the problem by pinning last_lookup through increasing its ref-count
> (the following patch is only compile tested):
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 6e8543ca6912..179e0056fae1 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -279,7 +279,14 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
>  		part = rcu_dereference(ptbl->part[i]);
> 
>  		if (part && sector_in_part(part, sector)) {
> -			rcu_assign_pointer(ptbl->last_lookup, part);
> +			struct hd_struct *old;
> +
> +			if (!hd_struct_try_get(part))
> +				break;
> +
> +			old = xchg(&ptbl->last_lookup, part);
> +			if (old)
> +				hd_struct_put(old);
>  			return part;
>  		}
>  	}
> @@ -1231,7 +1238,11 @@ static void disk_replace_part_tbl(struct gendisk *disk,
>  	rcu_assign_pointer(disk->part_tbl, new_ptbl);
> 
>  	if (old_ptbl) {
> -		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
> +		struct hd_struct *part;
> +
> +		part = xchg(&old_ptbl->last_lookup, NULL);
> +		if (part)
> +			hd_struct_put(part);
>  		kfree_rcu(old_ptbl, rcu_head);
>  	}
>  }
> diff --git a/block/partition-generic.c b/block/partition-generic.c
> index 98d60a59b843..441c1c591c04 100644
> --- a/block/partition-generic.c
> +++ b/block/partition-generic.c
> @@ -285,7 +285,8 @@ void delete_partition(struct gendisk *disk, int partno)
>  		return;
> 
>  	rcu_assign_pointer(ptbl->part[partno], NULL);
> -	rcu_assign_pointer(ptbl->last_lookup, NULL);
> +	if (cmpxchg(&ptbl->last_lookup, part, NULL) == part)
> +		hd_struct_put(part);
>  	kobject_put(part->holder_dir);
>  	device_del(part_to_dev(part));
> 
> -- 
> 2.22.0
> 
> Regards,
> Tao
> 
> 
> > diff --git a/block/partition-generic.c b/block/partition-generic.c
> > index 1d20c9cf213f..1e0065ed6f02 100644
> > --- a/block/partition-generic.c
> > +++ b/block/partition-generic.c
> > @@ -284,6 +284,13 @@ void delete_partition(struct gendisk *disk, int partno)
> >  		return;
> >  
> >  	rcu_assign_pointer(ptbl->part[partno], NULL);
> > +	/*
> > +	 * Without the memory barrier, disk_map_sector_rcu()
> > +	 * may read the old value after overwriting the
> > +	 * last_lookup. Then it can not clear last_lookup,
> > +	 * which may cause use-after-free.
> > +	 */
> > +	smp_mb();
> >  	rcu_assign_pointer(ptbl->last_lookup, NULL);
> >  	kobject_put(part->holder_dir);
> >  	device_del(part_to_dev(part));
> > 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2019-12-31 23:11   ` Paul E. McKenney
@ 2020-01-01  2:33     ` htbegin
  2020-01-01  3:39       ` htbegin
  2020-01-03 23:45     ` Joel Fernandes
  1 sibling, 1 reply; 20+ messages in thread
From: htbegin @ 2020-01-01  2:33 UTC (permalink / raw)
  To: paulmck
  Cc: Hou Tao, Yufen Yu, axboe, linux-block, ming.lei, hch, zhengchuan,
	yi.zhang, joel, rcu

Hi Paul,

Paul E. McKenney <paulmck@kernel.org> 于2020年1月1日周三 上午7:13写道:
>
> On Tue, Dec 31, 2019 at 10:55:47PM +0800, Hou Tao wrote:
> > Hi,
> >
> > On 2019/12/31 19:09, Yufen Yu wrote:
> > > When delete partition executes concurrently with IOs issue,
> > > it may cause use-after-free on part in disk_map_sector_rcu()
> > > as following:
> > snip
> >
> > >
> > > diff --git a/block/genhd.c b/block/genhd.c
> > > index ff6268970ddc..39fa8999905f 100644
> > > --- a/block/genhd.c
> > > +++ b/block/genhd.c
> > > @@ -293,7 +293,23 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> > >             part = rcu_dereference(ptbl->part[i]);
> > >
> > >             if (part && sector_in_part(part, sector)) {
> > snip
> >
> > >                     rcu_assign_pointer(ptbl->last_lookup, part);
> > > +                   part = rcu_dereference(ptbl->part[i]);
> > > +                   if (part == NULL) {
> > > +                           rcu_assign_pointer(ptbl->last_lookup, NULL);
> > > +                           break;
> > > +                   }
> > >                     return part;
> > >             }
> > >     }
> >
> > Not ensure whether the re-read can handle the following case or not:
> >
> > process A                                 process B                          process C
> >
> > disk_map_sector_rcu():                    delete_partition():               disk_map_sector_rcu():
> >
> > rcu_read_lock
> >
> >   // need to iterate partition table
> >   part[i] != NULL   (1)                   part[i] = NULL (2)
> >                                           smp_mb()
> >                                           last_lookup = NULL (3)
> >                                           call_rcu()  (4)
> >     last_lookup = part[i] (5)
> >
> >
> >                                                                              rcu_read_lock()
> >                                                                              read last_lookup return part[i] (6)
> >                                                                              sector_in_part() is OK (7)
> >                                                                              return part[i] (8)
>
> Just for the record...
>
> Use of RCU needs to ensure that readers cannot access the to-be-freed
> structure -before- invoking call_rcu().  Which does look to happen here
> with the "last_lookup = NULL".

But the last_lookup pointer may be reassigned a to-be-freed pointer (namely
part[i]) as show in process A, and I think the RCU read section starts after
the call of call_rcu() may read the reassigned pointer and use it, right ?

> But in addition, the callback needs to
> get access to the to-be-freed structure via some sideband (usually the
> structure passed to call_rcu()), not from the reader-accessible structure.
>
Er, could you clarify the meaning of "reader-accessible structure" ? Do you
mean "reader-writable structure" like last_lookup ? In the case, the callback
does access the to-be-freed struct by the embedded "rcu_work".

> Or am I misinterpreting this sequence of events?
>
Ah, I over simplify the procedures happened in process B. Does the following
diagram make it clearer ?

process A                           process B                     process C

disk_map_sector_rcu():              delete_partition():
disk_map_sector_rcu():

rcu_read_lock

  // iterate part table
  part[i] != NULL   (1)
                                    deleted = part[i]
                                    part[i] = NULL (2)
                                    smp_mb()
                                    last_lookup = NULL (3)

                                    percpu_ref_kill(deleted)
                                      // last ref
                                      __delete_partition (4)
                                        queue_rcu_work
    // re-assign last_lookup
    last_lookup = part[i] (5)



rcu_read_lock()
                                                                  //
hit last_lookup
                                                                  read
last_lookup returns part[i] (6)

sector_in_part() is OK (7)
                                                                  will
return part[i] (8)

rcu_read_unlock()

  // found part[i] is NULL
  // reassign last_lookup
  last_lookup = NULL
  will return part0

  rcu_read_unlock
                                    // RCU callback
                                    delete_partition_work_fn
                                      free deleted
                                                                   //
use-after-free ?

hd_struct_try_get
Regards,
Tao


>                                                         Thanx, Paul
>
> >   part[i] == NULL (9)
> >       last_lookup = NULL (10)
> >   rcu_read_unlock() (11)
> >                                            one RCU grace period completes
> >                                            __delete_partition() (12)
> >                                            free hd_partition (13)
> >                                                                              // use-after-free
> >                                                                              hd_struct_try_get(part[i])  (14)
> >
> > * the number in the parenthesis is the sequence of events.
> >
> > Maybe RCU experts can shed some light on this problem, so cc +paulmck@kernel.org, +joel@joelfernandes.org and +RCU maillist.
> >
> > If the above case is possible, maybe we can fix the problem by pinning last_lookup through increasing its ref-count
> > (the following patch is only compile tested):
> >
> > diff --git a/block/genhd.c b/block/genhd.c
> > index 6e8543ca6912..179e0056fae1 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -279,7 +279,14 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> >               part = rcu_dereference(ptbl->part[i]);
> >
> >               if (part && sector_in_part(part, sector)) {
> > -                     rcu_assign_pointer(ptbl->last_lookup, part);
> > +                     struct hd_struct *old;
> > +
> > +                     if (!hd_struct_try_get(part))
> > +                             break;
> > +
> > +                     old = xchg(&ptbl->last_lookup, part);
> > +                     if (old)
> > +                             hd_struct_put(old);
> >                       return part;
> >               }
> >       }
> > @@ -1231,7 +1238,11 @@ static void disk_replace_part_tbl(struct gendisk *disk,
> >       rcu_assign_pointer(disk->part_tbl, new_ptbl);
> >
> >       if (old_ptbl) {
> > -             rcu_assign_pointer(old_ptbl->last_lookup, NULL);
> > +             struct hd_struct *part;
> > +
> > +             part = xchg(&old_ptbl->last_lookup, NULL);
> > +             if (part)
> > +                     hd_struct_put(part);
> >               kfree_rcu(old_ptbl, rcu_head);
> >       }
> >  }
> > diff --git a/block/partition-generic.c b/block/partition-generic.c
> > index 98d60a59b843..441c1c591c04 100644
> > --- a/block/partition-generic.c
> > +++ b/block/partition-generic.c
> > @@ -285,7 +285,8 @@ void delete_partition(struct gendisk *disk, int partno)
> >               return;
> >
> >       rcu_assign_pointer(ptbl->part[partno], NULL);
> > -     rcu_assign_pointer(ptbl->last_lookup, NULL);
> > +     if (cmpxchg(&ptbl->last_lookup, part, NULL) == part)
> > +             hd_struct_put(part);
> >       kobject_put(part->holder_dir);
> >       device_del(part_to_dev(part));
> >
> > --
> > 2.22.0
> >
> > Regards,
> > Tao
> >
> >
> > > diff --git a/block/partition-generic.c b/block/partition-generic.c
> > > index 1d20c9cf213f..1e0065ed6f02 100644
> > > --- a/block/partition-generic.c
> > > +++ b/block/partition-generic.c
> > > @@ -284,6 +284,13 @@ void delete_partition(struct gendisk *disk, int partno)
> > >             return;
> > >
> > >     rcu_assign_pointer(ptbl->part[partno], NULL);
> > > +   /*
> > > +    * Without the memory barrier, disk_map_sector_rcu()
> > > +    * may read the old value after overwriting the
> > > +    * last_lookup. Then it can not clear last_lookup,
> > > +    * which may cause use-after-free.
> > > +    */
> > > +   smp_mb();
> > >     rcu_assign_pointer(ptbl->last_lookup, NULL);
> > >     kobject_put(part->holder_dir);
> > >     device_del(part_to_dev(part));
> > >
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-01  2:33     ` htbegin
@ 2020-01-01  3:39       ` htbegin
  0 siblings, 0 replies; 20+ messages in thread
From: htbegin @ 2020-01-01  3:39 UTC (permalink / raw)
  To: paulmck
  Cc: Hou Tao, Yufen Yu, axboe, linux-block, ming.lei, hch, zhengchuan,
	yi.zhang, joel, rcu

Sorry, it seems that i miss-formatted the diagram, so I resend it here:

process A                    process B            process C

disk_map_sector_rcu():       delete_partition():  disk_map_sector_rcu():

rcu_read_lock

  // iterate part table
  part[i] != NULL  (1)

                             deleted = part[i]
                             part[i] = NULL (2)
                             smp_mb()
                             last_lookup = NULL (3)

                             percpu_ref_kill(deleted)
                               // last ref
                               __delete_partition (4)
                                 queue_rcu_work

  // re-assign last_lookup
  last_lookup = part[i] (5)


                                                  rcu_read_lock()
                                                  // is it possible ?
                                                  last_lookup is part[i] (6)
                                                  sector_in_part() is OK (7)
                                                  will return part[i] (8)
                                                  rcu_read_unlock()

  // found part[i] is NULL
  // reassign last_lookup
  last_lookup = NULL
  will return part0
  rcu_read_unlock

                               // RCU callback
                               delete_partition_work_fn
                                 free deleted
                                                       // use-after-free ?
                                                       hd_struct_try_get

On Wed, Jan 1, 2020 at 10:33 AM htbegin <hotforest@gmail.com> wrote:
>
> Hi Paul,
>
> Paul E. McKenney <paulmck@kernel.org> 于2020年1月1日周三 上午7:13写道:
> >
> > On Tue, Dec 31, 2019 at 10:55:47PM +0800, Hou Tao wrote:
> > > Hi,
> > >
> > > On 2019/12/31 19:09, Yufen Yu wrote:
> > > > When delete partition executes concurrently with IOs issue,
> > > > it may cause use-after-free on part in disk_map_sector_rcu()
> > > > as following:
> > > snip
> > >
> > > >
> > > > diff --git a/block/genhd.c b/block/genhd.c
> > > > index ff6268970ddc..39fa8999905f 100644
> > > > --- a/block/genhd.c
> > > > +++ b/block/genhd.c
> > > > @@ -293,7 +293,23 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> > > >             part = rcu_dereference(ptbl->part[i]);
> > > >
> > > >             if (part && sector_in_part(part, sector)) {
> > > snip
> > >
> > > >                     rcu_assign_pointer(ptbl->last_lookup, part);
> > > > +                   part = rcu_dereference(ptbl->part[i]);
> > > > +                   if (part == NULL) {
> > > > +                           rcu_assign_pointer(ptbl->last_lookup, NULL);
> > > > +                           break;
> > > > +                   }
> > > >                     return part;
> > > >             }
> > > >     }
> > >
> > > Not ensure whether the re-read can handle the following case or not:
> > >
> > > process A                                 process B                          process C
> > >
> > > disk_map_sector_rcu():                    delete_partition():               disk_map_sector_rcu():
> > >
> > > rcu_read_lock
> > >
> > >   // need to iterate partition table
> > >   part[i] != NULL   (1)                   part[i] = NULL (2)
> > >                                           smp_mb()
> > >                                           last_lookup = NULL (3)
> > >                                           call_rcu()  (4)
> > >     last_lookup = part[i] (5)
> > >
> > >
> > >                                                                              rcu_read_lock()
> > >                                                                              read last_lookup return part[i] (6)
> > >                                                                              sector_in_part() is OK (7)
> > >                                                                              return part[i] (8)
> >
> > Just for the record...
> >
> > Use of RCU needs to ensure that readers cannot access the to-be-freed
> > structure -before- invoking call_rcu().  Which does look to happen here
> > with the "last_lookup = NULL".
>
> But the last_lookup pointer may be reassigned a to-be-freed pointer (namely
> part[i]) as show in process A, and I think the RCU read section starts after
> the call of call_rcu() may read the reassigned pointer and use it, right ?
>
> > But in addition, the callback needs to
> > get access to the to-be-freed structure via some sideband (usually the
> > structure passed to call_rcu()), not from the reader-accessible structure.
> >
> Er, could you clarify the meaning of "reader-accessible structure" ? Do you
> mean "reader-writable structure" like last_lookup ? In the case, the callback
> does access the to-be-freed struct by the embedded "rcu_work".
>
> > Or am I misinterpreting this sequence of events?
> >
> Ah, I over simplify the procedures happened in process B. Does the following
> diagram make it clearer ?
>
> process A                           process B                     process C
>
> disk_map_sector_rcu():              delete_partition():
> disk_map_sector_rcu():
>
> rcu_read_lock
>
>   // iterate part table
>   part[i] != NULL   (1)
>                                     deleted = part[i]
>                                     part[i] = NULL (2)
>                                     smp_mb()
>                                     last_lookup = NULL (3)
>
>                                     percpu_ref_kill(deleted)
>                                       // last ref
>                                       __delete_partition (4)
>                                         queue_rcu_work
>     // re-assign last_lookup
>     last_lookup = part[i] (5)
>
>
>
> rcu_read_lock()
>                                                                   //
> hit last_lookup
>                                                                   read
> last_lookup returns part[i] (6)
>
> sector_in_part() is OK (7)
>                                                                   will
> return part[i] (8)
>
> rcu_read_unlock()
>
>   // found part[i] is NULL
>   // reassign last_lookup
>   last_lookup = NULL
>   will return part0
>
>   rcu_read_unlock
>                                     // RCU callback
>                                     delete_partition_work_fn
>                                       free deleted
>                                                                    //
> use-after-free ?
>
> hd_struct_try_get
> Regards,
> Tao
>
>
> >                                                         Thanx, Paul
> >
> > >   part[i] == NULL (9)
> > >       last_lookup = NULL (10)
> > >   rcu_read_unlock() (11)
> > >                                            one RCU grace period completes
> > >                                            __delete_partition() (12)
> > >                                            free hd_partition (13)
> > >                                                                              // use-after-free
> > >                                                                              hd_struct_try_get(part[i])  (14)
> > >
> > > * the number in the parenthesis is the sequence of events.
> > >
> > > Maybe RCU experts can shed some light on this problem, so cc +paulmck@kernel.org, +joel@joelfernandes.org and +RCU maillist.
> > >
> > > If the above case is possible, maybe we can fix the problem by pinning last_lookup through increasing its ref-count
> > > (the following patch is only compile tested):
> > >
> > > diff --git a/block/genhd.c b/block/genhd.c
> > > index 6e8543ca6912..179e0056fae1 100644
> > > --- a/block/genhd.c
> > > +++ b/block/genhd.c
> > > @@ -279,7 +279,14 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> > >               part = rcu_dereference(ptbl->part[i]);
> > >
> > >               if (part && sector_in_part(part, sector)) {
> > > -                     rcu_assign_pointer(ptbl->last_lookup, part);
> > > +                     struct hd_struct *old;
> > > +
> > > +                     if (!hd_struct_try_get(part))
> > > +                             break;
> > > +
> > > +                     old = xchg(&ptbl->last_lookup, part);
> > > +                     if (old)
> > > +                             hd_struct_put(old);
> > >                       return part;
> > >               }
> > >       }
> > > @@ -1231,7 +1238,11 @@ static void disk_replace_part_tbl(struct gendisk *disk,
> > >       rcu_assign_pointer(disk->part_tbl, new_ptbl);
> > >
> > >       if (old_ptbl) {
> > > -             rcu_assign_pointer(old_ptbl->last_lookup, NULL);
> > > +             struct hd_struct *part;
> > > +
> > > +             part = xchg(&old_ptbl->last_lookup, NULL);
> > > +             if (part)
> > > +                     hd_struct_put(part);
> > >               kfree_rcu(old_ptbl, rcu_head);
> > >       }
> > >  }
> > > diff --git a/block/partition-generic.c b/block/partition-generic.c
> > > index 98d60a59b843..441c1c591c04 100644
> > > --- a/block/partition-generic.c
> > > +++ b/block/partition-generic.c
> > > @@ -285,7 +285,8 @@ void delete_partition(struct gendisk *disk, int partno)
> > >               return;
> > >
> > >       rcu_assign_pointer(ptbl->part[partno], NULL);
> > > -     rcu_assign_pointer(ptbl->last_lookup, NULL);
> > > +     if (cmpxchg(&ptbl->last_lookup, part, NULL) == part)
> > > +             hd_struct_put(part);
> > >       kobject_put(part->holder_dir);
> > >       device_del(part_to_dev(part));
> > >
> > > --
> > > 2.22.0
> > >
> > > Regards,
> > > Tao
> > >
> > >
> > > > diff --git a/block/partition-generic.c b/block/partition-generic.c
> > > > index 1d20c9cf213f..1e0065ed6f02 100644
> > > > --- a/block/partition-generic.c
> > > > +++ b/block/partition-generic.c
> > > > @@ -284,6 +284,13 @@ void delete_partition(struct gendisk *disk, int partno)
> > > >             return;
> > > >
> > > >     rcu_assign_pointer(ptbl->part[partno], NULL);
> > > > +   /*
> > > > +    * Without the memory barrier, disk_map_sector_rcu()
> > > > +    * may read the old value after overwriting the
> > > > +    * last_lookup. Then it can not clear last_lookup,
> > > > +    * which may cause use-after-free.
> > > > +    */
> > > > +   smp_mb();
> > > >     rcu_assign_pointer(ptbl->last_lookup, NULL);
> > > >     kobject_put(part->holder_dir);
> > > >     device_del(part_to_dev(part));
> > > >
> > >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2019-12-31 14:55 ` [PATCH] block: make sure last_lookup set as NULL after part deleted Hou Tao
  2019-12-31 23:11   ` Paul E. McKenney
@ 2020-01-02  1:23   ` Ming Lei
  2020-01-03  3:06     ` Hou Tao
  2020-01-03 12:43   ` Yufen Yu
  2 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2020-01-02  1:23 UTC (permalink / raw)
  To: Hou Tao
  Cc: Yufen Yu, axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck,
	joel, rcu

On Tue, Dec 31, 2019 at 10:55:47PM +0800, Hou Tao wrote:
> Hi,
> 
> On 2019/12/31 19:09, Yufen Yu wrote:
> > When delete partition executes concurrently with IOs issue,
> > it may cause use-after-free on part in disk_map_sector_rcu()
> > as following:
> snip
> 
> > 
> > diff --git a/block/genhd.c b/block/genhd.c
> > index ff6268970ddc..39fa8999905f 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -293,7 +293,23 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> >  		part = rcu_dereference(ptbl->part[i]);
> >  
> >  		if (part && sector_in_part(part, sector)) {
> snip
> 
> >  			rcu_assign_pointer(ptbl->last_lookup, part);
> > +			part = rcu_dereference(ptbl->part[i]);
> > +			if (part == NULL) {
> > +				rcu_assign_pointer(ptbl->last_lookup, NULL);
> > +				break;
> > +			}
> >  			return part;
> >  		}
> >  	}
> 
> Not ensure whether the re-read can handle the following case or not:
> 
> process A                                 process B                          process C
> 
> disk_map_sector_rcu():                    delete_partition():               disk_map_sector_rcu():
> 
> rcu_read_lock
> 
>   // need to iterate partition table
>   part[i] != NULL   (1)                   part[i] = NULL (2)
>                                           smp_mb()
>                                           last_lookup = NULL (3)
>                                           call_rcu()  (4)
>     last_lookup = part[i] (5)
> 
> 
>                                                                              rcu_read_lock()
>                                                                              read last_lookup return part[i] (6)
>                                                                              sector_in_part() is OK (7)
>                                                                              return part[i] (8)
> 
>   part[i] == NULL (9)
>       last_lookup = NULL (10)
>   rcu_read_unlock() (11)
>                                            one RCU grace period completes
>                                            __delete_partition() (12)
>                                            free hd_partition (13)
>                                                                              // use-after-free
>                                                                              hd_struct_try_get(part[i])  (14)
> 
> * the number in the parenthesis is the sequence of events.
> 
> Maybe RCU experts can shed some light on this problem, so cc +paulmck@kernel.org, +joel@joelfernandes.org and +RCU maillist.
> 
> If the above case is possible, maybe we can fix the problem by pinning last_lookup through increasing its ref-count
> (the following patch is only compile tested):
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 6e8543ca6912..179e0056fae1 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -279,7 +279,14 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
>  		part = rcu_dereference(ptbl->part[i]);
> 
>  		if (part && sector_in_part(part, sector)) {
> -			rcu_assign_pointer(ptbl->last_lookup, part);
> +			struct hd_struct *old;
> +
> +			if (!hd_struct_try_get(part))
> +				break;
> +
> +			old = xchg(&ptbl->last_lookup, part);
> +			if (old)
> +				hd_struct_put(old);
>  			return part;
>  		}
>  	}
> @@ -1231,7 +1238,11 @@ static void disk_replace_part_tbl(struct gendisk *disk,
>  	rcu_assign_pointer(disk->part_tbl, new_ptbl);
> 
>  	if (old_ptbl) {
> -		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
> +		struct hd_struct *part;
> +
> +		part = xchg(&old_ptbl->last_lookup, NULL);
> +		if (part)
> +			hd_struct_put(part);
>  		kfree_rcu(old_ptbl, rcu_head);
>  	}
>  }
> diff --git a/block/partition-generic.c b/block/partition-generic.c
> index 98d60a59b843..441c1c591c04 100644
> --- a/block/partition-generic.c
> +++ b/block/partition-generic.c
> @@ -285,7 +285,8 @@ void delete_partition(struct gendisk *disk, int partno)
>  		return;
> 
>  	rcu_assign_pointer(ptbl->part[partno], NULL);
> -	rcu_assign_pointer(ptbl->last_lookup, NULL);
> +	if (cmpxchg(&ptbl->last_lookup, part, NULL) == part)
> +		hd_struct_put(part);
>  	kobject_put(part->holder_dir);
>  	device_del(part_to_dev(part));

IMO this approach looks good.

Given partition is actually protected by percpu-refcount now, I guess the
RCU annotation for referencing ->part[partno] and ->last_lookup may not
be necessary, together with the part->rcu_work.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-02  1:23   ` Ming Lei
@ 2020-01-03  3:06     ` Hou Tao
  2020-01-03  4:18       ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Hou Tao @ 2020-01-03  3:06 UTC (permalink / raw)
  To: Ming Lei
  Cc: Yufen Yu, axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck,
	joel, rcu

Hi,

On 2020/1/2 9:23, Ming Lei wrote:
> On Tue, Dec 31, 2019 at 10:55:47PM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 2019/12/31 19:09, Yufen Yu wrote:
>>> When delete partition executes concurrently with IOs issue,
>>> it may cause use-after-free on part in disk_map_sector_rcu()
>>> as following:
>> snip
>>
>>>
>>> diff --git a/block/genhd.c b/block/genhd.c
>>> index ff6268970ddc..39fa8999905f 100644
>>> --- a/block/genhd.c
>>> +++ b/block/genhd.c
>>> @@ -293,7 +293,23 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
>>>  		part = rcu_dereference(ptbl->part[i]);
>>>  
>>>  		if (part && sector_in_part(part, sector)) {
>> snip
>>
>>>  			rcu_assign_pointer(ptbl->last_lookup, part);
>>> +			part = rcu_dereference(ptbl->part[i]);
>>> +			if (part == NULL) {
>>> +				rcu_assign_pointer(ptbl->last_lookup, NULL);
>>> +				break;
>>> +			}
>>>  			return part;
>>>  		}
>>>  	}
>>
>> Not ensure whether the re-read can handle the following case or not:
>>
We have written a similar test case for the following case and found out that
process C still may got the freed hd_struct pointer from process A. So
the re-read will not resolve the problem.

>> process A                                 process B                          process C
>>
>> disk_map_sector_rcu():                    delete_partition():               disk_map_sector_rcu():
>>
>> rcu_read_lock
>>
>>   // need to iterate partition table
>>   part[i] != NULL   (1)                   part[i] = NULL (2)
>>                                           smp_mb()
>>                                           last_lookup = NULL (3)
>>                                           call_rcu()  (4)
>>     last_lookup = part[i] (5)
>>
>>
>>                                                                              rcu_read_lock()
>>                                                                              read last_lookup return part[i] (6)
>>                                                                              sector_in_part() is OK (7)
>>                                                                              return part[i] (8)
>>
>>   part[i] == NULL (9)
>>       last_lookup = NULL (10)
>>   rcu_read_unlock() (11)
>>                                            one RCU grace period completes
>>                                            __delete_partition() (12)
>>                                            free hd_partition (13)
>>                                                                              // use-after-free
>>                                                                              hd_struct_try_get(part[i])  (14)
>>
>> * the number in the parenthesis is the sequence of events.
>>



>> Maybe RCU experts can shed some light on this problem, so cc +paulmck@kernel.org, +joel@joelfernandes.org and +RCU maillist.
>>
>> If the above case is possible, maybe we can fix the problem by pinning last_lookup through increasing its ref-count
>> (the following patch is only compile tested):
>>
>> diff --git a/block/genhd.c b/block/genhd.c
>> index 6e8543ca6912..179e0056fae1 100644
>> --- a/block/genhd.c
>> +++ b/block/genhd.c
>> @@ -279,7 +279,14 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
>>  		part = rcu_dereference(ptbl->part[i]);
>>
>>  		if (part && sector_in_part(part, sector)) {
>> -			rcu_assign_pointer(ptbl->last_lookup, part);
>> +			struct hd_struct *old;
>> +
>> +			if (!hd_struct_try_get(part))
>> +				break;
>> +
>> +			old = xchg(&ptbl->last_lookup, part);
>> +			if (old)
>> +				hd_struct_put(old);
>>  			return part;
>>  		}
>>  	}
>> @@ -1231,7 +1238,11 @@ static void disk_replace_part_tbl(struct gendisk *disk,
>>  	rcu_assign_pointer(disk->part_tbl, new_ptbl);
>>
>>  	if (old_ptbl) {
>> -		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
>> +		struct hd_struct *part;
>> +
>> +		part = xchg(&old_ptbl->last_lookup, NULL);
>> +		if (part)
>> +			hd_struct_put(part);
>>  		kfree_rcu(old_ptbl, rcu_head);
>>  	}
>>  }
>> diff --git a/block/partition-generic.c b/block/partition-generic.c
>> index 98d60a59b843..441c1c591c04 100644
>> --- a/block/partition-generic.c
>> +++ b/block/partition-generic.c
>> @@ -285,7 +285,8 @@ void delete_partition(struct gendisk *disk, int partno)
>>  		return;
>>
>>  	rcu_assign_pointer(ptbl->part[partno], NULL);
>> -	rcu_assign_pointer(ptbl->last_lookup, NULL);
>> +	if (cmpxchg(&ptbl->last_lookup, part, NULL) == part)
>> +		hd_struct_put(part);
>>  	kobject_put(part->holder_dir);
>>  	device_del(part_to_dev(part));
> 
> IMO this approach looks good.
>
Not sure about the overhead when there are concurrent IOs on different partitions,
we will measure that.

We have got a seemingly better solution: caching the index of last_lookup in tbl->part[]
instead of caching the pointer itself, so we can ensure the validity of returned pointer
by ensuring it's not NULL in tbl->part[] as does when last_lookup is NULL or 0.

> Given partition is actually protected by percpu-refcount now, I guess the
> RCU annotation for referencing ->part[partno] and ->last_lookup may not
> be necessary, together with the part->rcu_work.
> 
So we will depends on the invocation of of call_rcu() on __percpu_ref_switch_mode() to
ensure the RCU readers will find part[i] is NULL before trying to increasing
the atomic ref-counter of part[i], right ?

Regards,
Tao

> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-03  3:06     ` Hou Tao
@ 2020-01-03  4:18       ` Ming Lei
  2020-01-03  7:35         ` Hou Tao
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2020-01-03  4:18 UTC (permalink / raw)
  To: Hou Tao
  Cc: Yufen Yu, axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck,
	joel, rcu

On Fri, Jan 03, 2020 at 11:06:25AM +0800, Hou Tao wrote:
> Hi,
> 
> On 2020/1/2 9:23, Ming Lei wrote:
> > On Tue, Dec 31, 2019 at 10:55:47PM +0800, Hou Tao wrote:
> >> Hi,
> >>
> >> On 2019/12/31 19:09, Yufen Yu wrote:
> >>> When delete partition executes concurrently with IOs issue,
> >>> it may cause use-after-free on part in disk_map_sector_rcu()
> >>> as following:
> >> snip
> >>
> >>>
> >>> diff --git a/block/genhd.c b/block/genhd.c
> >>> index ff6268970ddc..39fa8999905f 100644
> >>> --- a/block/genhd.c
> >>> +++ b/block/genhd.c
> >>> @@ -293,7 +293,23 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> >>>  		part = rcu_dereference(ptbl->part[i]);
> >>>  
> >>>  		if (part && sector_in_part(part, sector)) {
> >> snip
> >>
> >>>  			rcu_assign_pointer(ptbl->last_lookup, part);
> >>> +			part = rcu_dereference(ptbl->part[i]);
> >>> +			if (part == NULL) {
> >>> +				rcu_assign_pointer(ptbl->last_lookup, NULL);
> >>> +				break;
> >>> +			}
> >>>  			return part;
> >>>  		}
> >>>  	}
> >>
> >> Not ensure whether the re-read can handle the following case or not:
> >>
> We have written a similar test case for the following case and found out that
> process C still may got the freed hd_struct pointer from process A. So
> the re-read will not resolve the problem.
> 
> >> process A                                 process B                          process C
> >>
> >> disk_map_sector_rcu():                    delete_partition():               disk_map_sector_rcu():
> >>
> >> rcu_read_lock
> >>
> >>   // need to iterate partition table
> >>   part[i] != NULL   (1)                   part[i] = NULL (2)
> >>                                           smp_mb()
> >>                                           last_lookup = NULL (3)
> >>                                           call_rcu()  (4)
> >>     last_lookup = part[i] (5)
> >>
> >>
> >>                                                                              rcu_read_lock()
> >>                                                                              read last_lookup return part[i] (6)
> >>                                                                              sector_in_part() is OK (7)
> >>                                                                              return part[i] (8)
> >>
> >>   part[i] == NULL (9)
> >>       last_lookup = NULL (10)
> >>   rcu_read_unlock() (11)
> >>                                            one RCU grace period completes
> >>                                            __delete_partition() (12)
> >>                                            free hd_partition (13)
> >>                                                                              // use-after-free
> >>                                                                              hd_struct_try_get(part[i])  (14)
> >>
> >> * the number in the parenthesis is the sequence of events.
> >>
> 
> 
> 
> >> Maybe RCU experts can shed some light on this problem, so cc +paulmck@kernel.org, +joel@joelfernandes.org and +RCU maillist.
> >>
> >> If the above case is possible, maybe we can fix the problem by pinning last_lookup through increasing its ref-count
> >> (the following patch is only compile tested):
> >>
> >> diff --git a/block/genhd.c b/block/genhd.c
> >> index 6e8543ca6912..179e0056fae1 100644
> >> --- a/block/genhd.c
> >> +++ b/block/genhd.c
> >> @@ -279,7 +279,14 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> >>  		part = rcu_dereference(ptbl->part[i]);
> >>
> >>  		if (part && sector_in_part(part, sector)) {
> >> -			rcu_assign_pointer(ptbl->last_lookup, part);
> >> +			struct hd_struct *old;
> >> +
> >> +			if (!hd_struct_try_get(part))
> >> +				break;
> >> +
> >> +			old = xchg(&ptbl->last_lookup, part);
> >> +			if (old)
> >> +				hd_struct_put(old);
> >>  			return part;
> >>  		}
> >>  	}
> >> @@ -1231,7 +1238,11 @@ static void disk_replace_part_tbl(struct gendisk *disk,
> >>  	rcu_assign_pointer(disk->part_tbl, new_ptbl);
> >>
> >>  	if (old_ptbl) {
> >> -		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
> >> +		struct hd_struct *part;
> >> +
> >> +		part = xchg(&old_ptbl->last_lookup, NULL);
> >> +		if (part)
> >> +			hd_struct_put(part);
> >>  		kfree_rcu(old_ptbl, rcu_head);
> >>  	}
> >>  }
> >> diff --git a/block/partition-generic.c b/block/partition-generic.c
> >> index 98d60a59b843..441c1c591c04 100644
> >> --- a/block/partition-generic.c
> >> +++ b/block/partition-generic.c
> >> @@ -285,7 +285,8 @@ void delete_partition(struct gendisk *disk, int partno)
> >>  		return;
> >>
> >>  	rcu_assign_pointer(ptbl->part[partno], NULL);
> >> -	rcu_assign_pointer(ptbl->last_lookup, NULL);
> >> +	if (cmpxchg(&ptbl->last_lookup, part, NULL) == part)
> >> +		hd_struct_put(part);
> >>  	kobject_put(part->holder_dir);
> >>  	device_del(part_to_dev(part));
> > 
> > IMO this approach looks good.
> >
> Not sure about the overhead when there are concurrent IOs on different partitions,
> we will measure that.
> 
> We have got a seemingly better solution: caching the index of last_lookup in tbl->part[]
> instead of caching the pointer itself, so we can ensure the validity of returned pointer
> by ensuring it's not NULL in tbl->part[] as does when last_lookup is NULL or 0.

Thinking of the problem further, looks we don't need to hold ref for
.last_lookup.

What we need is to make sure the partition's ref is increased just
before assigning .last_lookup, so how about something like the following?

diff --git a/block/blk-core.c b/block/blk-core.c
index 089e890ab208..79599f5fd5b7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1365,18 +1365,6 @@ void blk_account_io_start(struct request *rq, bool new_io)
 		part_stat_inc(part, merges[rw]);
 	} else {
 		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
-		if (!hd_struct_try_get(part)) {
-			/*
-			 * The partition is already being removed,
-			 * the request will be accounted on the disk only
-			 *
-			 * We take a reference on disk->part0 although that
-			 * partition will never be deleted, so we can treat
-			 * it as any other partition.
-			 */
-			part = &rq->rq_disk->part0;
-			hd_struct_get(part);
-		}
 		part_inc_in_flight(rq->q, part, rw);
 		rq->part = part;
 	}
diff --git a/block/genhd.c b/block/genhd.c
index ff6268970ddc..21f4a9b8d24d 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -286,17 +286,24 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
 	ptbl = rcu_dereference(disk->part_tbl);
 
 	part = rcu_dereference(ptbl->last_lookup);
-	if (part && sector_in_part(part, sector))
+	if (part && sector_in_part(part, sector)) {
+		if (!hd_struct_try_get(part))
+			goto exit;
 		return part;
+	}
 
 	for (i = 1; i < ptbl->len; i++) {
 		part = rcu_dereference(ptbl->part[i]);
 
 		if (part && sector_in_part(part, sector)) {
+                       if (!hd_struct_try_get(part))
+                               goto exit;
 			rcu_assign_pointer(ptbl->last_lookup, part);
 			return part;
 		}
 	}
+ exit:
+	hd_struct_get(&disk->part0);
 	return &disk->part0;
 }
 EXPORT_SYMBOL_GPL(disk_map_sector_rcu);


> 
> > Given partition is actually protected by percpu-refcount now, I guess the
> > RCU annotation for referencing ->part[partno] and ->last_lookup may not
> > be necessary, together with the part->rcu_work.
> > 
> So we will depends on the invocation of of call_rcu() on __percpu_ref_switch_mode() to
> ensure the RCU readers will find part[i] is NULL before trying to increasing
> the atomic ref-counter of part[i], right ?

Yeah.

Thanks,
Ming


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-03  4:18       ` Ming Lei
@ 2020-01-03  7:35         ` Hou Tao
  2020-01-03  8:17           ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Hou Tao @ 2020-01-03  7:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Yufen Yu, axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck,
	joel, rcu

Hi,

On 2020/1/3 12:18, Ming Lei wrote:
> On Fri, Jan 03, 2020 at 11:06:25AM +0800, Hou Tao wrote:
>> Hi,
>>
>> On 2020/1/2 9:23, Ming Lei wrote:
>>> On Tue, Dec 31, 2019 at 10:55:47PM +0800, Hou Tao wrote:
>>>> Hi,
>>>>
snip

>> We have got a seemingly better solution: caching the index of last_lookup in tbl->part[]
>> instead of caching the pointer itself, so we can ensure the validity of returned pointer
>> by ensuring it's not NULL in tbl->part[] as does when last_lookup is NULL or 0.
> 
> Thinking of the problem further, looks we don't need to hold ref for
> .last_lookup.
> 
> What we need is to make sure the partition's ref is increased just
> before assigning .last_lookup, so how about something like the following?
> 
The approach will work for the above case, but it will not work for the following case:

when blk_account_io_done() releases the last ref-counter of last_lookup and calls call_rcu(),
and then a RCU read gets the to-be-freed hd-struct.

blk_account_io_done
  rcu_read_lock()
  // the last ref of last_lookup
  hd_struct_put()
    call_rcu

                              rcu_read_lock
                              read last_lookup

    free()
                              // use-after-free ?
                              hd_struct_try_get

Regards,
Tao

> diff --git a/block/blk-core.c b/block/blk-core.c
> index 089e890ab208..79599f5fd5b7 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1365,18 +1365,6 @@ void blk_account_io_start(struct request *rq, bool new_io)
>  		part_stat_inc(part, merges[rw]);
>  	} else {
>  		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> -		if (!hd_struct_try_get(part)) {
> -			/*
> -			 * The partition is already being removed,
> -			 * the request will be accounted on the disk only
> -			 *
> -			 * We take a reference on disk->part0 although that
> -			 * partition will never be deleted, so we can treat
> -			 * it as any other partition.
> -			 */
> -			part = &rq->rq_disk->part0;
> -			hd_struct_get(part);
> -		}
>  		part_inc_in_flight(rq->q, part, rw);
>  		rq->part = part;
>  	}
> diff --git a/block/genhd.c b/block/genhd.c
> index ff6268970ddc..21f4a9b8d24d 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -286,17 +286,24 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
>  	ptbl = rcu_dereference(disk->part_tbl);
>  
>  	part = rcu_dereference(ptbl->last_lookup);
> -	if (part && sector_in_part(part, sector))
> +	if (part && sector_in_part(part, sector)) {
> +		if (!hd_struct_try_get(part))
> +			goto exit;
>  		return part;
> +	}
>  
>  	for (i = 1; i < ptbl->len; i++) {
>  		part = rcu_dereference(ptbl->part[i]);
>  
>  		if (part && sector_in_part(part, sector)) {
> +                       if (!hd_struct_try_get(part))
> +                               goto exit;
>  			rcu_assign_pointer(ptbl->last_lookup, part);
>  			return part;
>  		}
>  	}
> + exit:
> +	hd_struct_get(&disk->part0);
>  	return &disk->part0;
>  }
>  EXPORT_SYMBOL_GPL(disk_map_sector_rcu);
> 
> 
>>
>>> Given partition is actually protected by percpu-refcount now, I guess the
>>> RCU annotation for referencing ->part[partno] and ->last_lookup may not
>>> be necessary, together with the part->rcu_work.
>>>
>> So we will depends on the invocation of of call_rcu() on __percpu_ref_switch_mode() to
>> ensure the RCU readers will find part[i] is NULL before trying to increasing
>> the atomic ref-counter of part[i], right ?
> 
> Yeah.
> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-03  7:35         ` Hou Tao
@ 2020-01-03  8:17           ` Ming Lei
  2020-01-03 12:03             ` Yufen Yu
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2020-01-03  8:17 UTC (permalink / raw)
  To: Hou Tao
  Cc: Yufen Yu, axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck,
	joel, rcu

On Fri, Jan 03, 2020 at 03:35:42PM +0800, Hou Tao wrote:
> Hi,
> 
> On 2020/1/3 12:18, Ming Lei wrote:
> > On Fri, Jan 03, 2020 at 11:06:25AM +0800, Hou Tao wrote:
> >> Hi,
> >>
> >> On 2020/1/2 9:23, Ming Lei wrote:
> >>> On Tue, Dec 31, 2019 at 10:55:47PM +0800, Hou Tao wrote:
> >>>> Hi,
> >>>>
> snip
> 
> >> We have got a seemingly better solution: caching the index of last_lookup in tbl->part[]
> >> instead of caching the pointer itself, so we can ensure the validity of returned pointer
> >> by ensuring it's not NULL in tbl->part[] as does when last_lookup is NULL or 0.
> > 
> > Thinking of the problem further, looks we don't need to hold ref for
> > .last_lookup.
> > 
> > What we need is to make sure the partition's ref is increased just
> > before assigning .last_lookup, so how about something like the following?
> > 
> The approach will work for the above case, but it will not work for the following case:
> 
> when blk_account_io_done() releases the last ref-counter of last_lookup and calls call_rcu(),
> and then a RCU read gets the to-be-freed hd-struct.
> 
> blk_account_io_done
>   rcu_read_lock()
>   // the last ref of last_lookup
>   hd_struct_put()
>     call_rcu
> 
>                               rcu_read_lock
>                               read last_lookup
> 
>     free()
>                               // use-after-free ?
>                               hd_struct_try_get

We may avoid that by clearing partition pointer after killing the
partition, how about the following change?

diff --git a/block/blk-core.c b/block/blk-core.c
index 089e890ab208..79599f5fd5b7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1365,18 +1365,6 @@ void blk_account_io_start(struct request *rq, bool new_io)
 		part_stat_inc(part, merges[rw]);
 	} else {
 		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
-		if (!hd_struct_try_get(part)) {
-			/*
-			 * The partition is already being removed,
-			 * the request will be accounted on the disk only
-			 *
-			 * We take a reference on disk->part0 although that
-			 * partition will never be deleted, so we can treat
-			 * it as any other partition.
-			 */
-			part = &rq->rq_disk->part0;
-			hd_struct_get(part);
-		}
 		part_inc_in_flight(rq->q, part, rw);
 		rq->part = part;
 	}
diff --git a/block/genhd.c b/block/genhd.c
index ff6268970ddc..e3dec90b1f43 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -286,17 +286,21 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
 	ptbl = rcu_dereference(disk->part_tbl);
 
 	part = rcu_dereference(ptbl->last_lookup);
-	if (part && sector_in_part(part, sector))
+	if (part && sector_in_part(part, sector) && hd_struct_try_get(part))
 		return part;
 
 	for (i = 1; i < ptbl->len; i++) {
 		part = rcu_dereference(ptbl->part[i]);
 
 		if (part && sector_in_part(part, sector)) {
+                       if (!hd_struct_try_get(part))
+                               goto exit;
 			rcu_assign_pointer(ptbl->last_lookup, part);
 			return part;
 		}
 	}
+ exit:
+	hd_struct_get(&disk->part0);
 	return &disk->part0;
 }
 EXPORT_SYMBOL_GPL(disk_map_sector_rcu);
diff --git a/block/partition-generic.c b/block/partition-generic.c
index 1d20c9cf213f..9ef6c13d5650 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -283,8 +283,8 @@ void delete_partition(struct gendisk *disk, int partno)
 	if (!part)
 		return;
 
-	rcu_assign_pointer(ptbl->part[partno], NULL);
-	rcu_assign_pointer(ptbl->last_lookup, NULL);
+	get_device(disk_to_dev(disk));
+
 	kobject_put(part->holder_dir);
 	device_del(part_to_dev(part));
 
@@ -296,6 +296,15 @@ void delete_partition(struct gendisk *disk, int partno)
 	 */
 	blk_invalidate_devt(part_devt(part));
 	hd_struct_kill(part);
+
+	/*
+	 * clear partition pointers after this partition is killed, then
+	 * IO path can't re-assign ->last_lookup any more
+	 */
+	rcu_assign_pointer(ptbl->part[partno], NULL);
+	rcu_assign_pointer(ptbl->last_lookup, NULL);
+
+	put_device(disk_to_dev(disk));
 }
 
 static ssize_t whole_disk_show(struct device *dev,

Thanks,
Ming


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-03  8:17           ` Ming Lei
@ 2020-01-03 12:03             ` Yufen Yu
  2020-01-03 15:16               ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Yufen Yu @ 2020-01-03 12:03 UTC (permalink / raw)
  To: Ming Lei, Hou Tao
  Cc: axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck, joel, rcu

Hi, Ming

On 2020/1/3 16:17, Ming Lei wrote:
> 
> We may avoid that by clearing partition pointer after killing the
> partition, how about the following change?
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 089e890ab208..79599f5fd5b7 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1365,18 +1365,6 @@ void blk_account_io_start(struct request *rq, bool new_io)
>   		part_stat_inc(part, merges[rw]);
>   	} else {
>   		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> -		if (!hd_struct_try_get(part)) {
> -			/*
> -			 * The partition is already being removed,
> -			 * the request will be accounted on the disk only
> -			 *
> -			 * We take a reference on disk->part0 although that
> -			 * partition will never be deleted, so we can treat
> -			 * it as any other partition.
> -			 */
> -			part = &rq->rq_disk->part0;
> -			hd_struct_get(part);
> -		}
>   		part_inc_in_flight(rq->q, part, rw);
>   		rq->part = part;
>   	}
> diff --git a/block/genhd.c b/block/genhd.c
> index ff6268970ddc..e3dec90b1f43 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -286,17 +286,21 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
>   	ptbl = rcu_dereference(disk->part_tbl);
>   
>   	part = rcu_dereference(ptbl->last_lookup);
> -	if (part && sector_in_part(part, sector))
> +	if (part && sector_in_part(part, sector) && hd_struct_try_get(part))
>   		return part;
>   
>   	for (i = 1; i < ptbl->len; i++) {
>   		part = rcu_dereference(ptbl->part[i]);
>   
>   		if (part && sector_in_part(part, sector)) {
> +                       if (!hd_struct_try_get(part))
> +                               goto exit;
>   			rcu_assign_pointer(ptbl->last_lookup, part);
>   			return part;
>   		}
>   	}
> + exit:
> +	hd_struct_get(&disk->part0);
>   	return &disk->part0;
>   }
>   EXPORT_SYMBOL_GPL(disk_map_sector_rcu);
> diff --git a/block/partition-generic.c b/block/partition-generic.c
> index 1d20c9cf213f..9ef6c13d5650 100644
> --- a/block/partition-generic.c
> +++ b/block/partition-generic.c
> @@ -283,8 +283,8 @@ void delete_partition(struct gendisk *disk, int partno)
>   	if (!part)
>   		return;
>   
> -	rcu_assign_pointer(ptbl->part[partno], NULL);
> -	rcu_assign_pointer(ptbl->last_lookup, NULL);
> +	get_device(disk_to_dev(disk));
> +
>   	kobject_put(part->holder_dir);
>   	device_del(part_to_dev(part));
>   
> @@ -296,6 +296,15 @@ void delete_partition(struct gendisk *disk, int partno)
>   	 */
>   	blk_invalidate_devt(part_devt(part));
>   	hd_struct_kill(part);
> +
> +	/*
> +	 * clear partition pointers after this partition is killed, then
> +	 * IO path can't re-assign ->last_lookup any more
> +	 */
> +	rcu_assign_pointer(ptbl->part[partno], NULL);
> +	rcu_assign_pointer(ptbl->last_lookup, NULL);
> +
> +	put_device(disk_to_dev(disk));
>   }
>   

This change may cannot solve follow case:

disk_map_sector_rcu     delete_partition  disk_map_sector_rcu
hd_struct_try_get(part)
                         hd_struct_kill
                         last_lookup = NULL;
last_lookup = part


call_rcu
                                            read last_lookup

free()
                                            //use-after-free
                                            sector_in_part(part, sector)

There is an interval between getting part and setting last_lookup
in disk_map_sector_rcu(). If we kill the part and clear last_lookup
at that interval, last_lookup will be re-assign again, which can cause
use-after-free for readers after call_rcu.

Thanks
Yufen

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2019-12-31 14:55 ` [PATCH] block: make sure last_lookup set as NULL after part deleted Hou Tao
  2019-12-31 23:11   ` Paul E. McKenney
  2020-01-02  1:23   ` Ming Lei
@ 2020-01-03 12:43   ` Yufen Yu
  2 siblings, 0 replies; 20+ messages in thread
From: Yufen Yu @ 2020-01-03 12:43 UTC (permalink / raw)
  To: houtao (A), axboe
  Cc: linux-block, ming.lei, hch, zhengchuan, zhangyi (F), paulmck, joel, rcu



On 2019/12/31 22:55, houtao (A) wrote:

> diff --git a/block/genhd.c b/block/genhd.c
> index 6e8543ca6912..179e0056fae1 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -279,7 +279,14 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
>   		part = rcu_dereference(ptbl->part[i]);
> 
>   		if (part && sector_in_part(part, sector)) {
> -			rcu_assign_pointer(ptbl->last_lookup, part);
> +			struct hd_struct *old;
> +
> +			if (!hd_struct_try_get(part))
> +				break;
> +
> +			old = xchg(&ptbl->last_lookup, part);
> +			if (old)
> +				hd_struct_put(old);
>   			return part;
>   		}
>   	}
> @@ -1231,7 +1238,11 @@ static void disk_replace_part_tbl(struct gendisk *disk,
>   	rcu_assign_pointer(disk->part_tbl, new_ptbl);
> 
>   	if (old_ptbl) {
> -		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
> +		struct hd_struct *part;
> +
> +		part = xchg(&old_ptbl->last_lookup, NULL);
> +		if (part)
> +			hd_struct_put(part);
>   		kfree_rcu(old_ptbl, rcu_head);
>   	}
>   }
> diff --git a/block/partition-generic.c b/block/partition-generic.c
> index 98d60a59b843..441c1c591c04 100644
> --- a/block/partition-generic.c
> +++ b/block/partition-generic.c
> @@ -285,7 +285,8 @@ void delete_partition(struct gendisk *disk, int partno)
>   		return;
> 
>   	rcu_assign_pointer(ptbl->part[partno], NULL);
> -	rcu_assign_pointer(ptbl->last_lookup, NULL);
> +	if (cmpxchg(&ptbl->last_lookup, part, NULL) == part)
> +		hd_struct_put(part);
>   	kobject_put(part->holder_dir);
>   	device_del(part_to_dev(part));
> 
This change will add more calling of hd_struct_try_get()/hd_struct_put()
when set last_lookup. Thus, Tao warried abort extra overhead.

We test the patch on a SSD disk by fio with config:
[global]
rw=randread
bs=4k
numjobs=2
ioengine=libaio
iodepth=32
direct=1
runtime=120
size=100G
group_reporting
time_based

[sda1]
filename=/dev/sda1

[sda2]
filename=/dev/sda2

[sda3]
filename=/dev/sda3

[sda4]
filename=/dev/sda4

Compared with the previous version, io bandwidth have no any degeneration.

Thanks,
Yufen






^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-03 12:03             ` Yufen Yu
@ 2020-01-03 15:16               ` Ming Lei
  2020-01-06  7:39                 ` Yufen Yu
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2020-01-03 15:16 UTC (permalink / raw)
  To: Yufen Yu
  Cc: Hou Tao, axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck,
	joel, rcu

Hello Yufen,

On Fri, Jan 03, 2020 at 08:03:54PM +0800, Yufen Yu wrote:
> Hi, Ming
> 
> On 2020/1/3 16:17, Ming Lei wrote:
> > 
> > We may avoid that by clearing partition pointer after killing the
> > partition, how about the following change?
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 089e890ab208..79599f5fd5b7 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1365,18 +1365,6 @@ void blk_account_io_start(struct request *rq, bool new_io)
> >   		part_stat_inc(part, merges[rw]);
> >   	} else {
> >   		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> > -		if (!hd_struct_try_get(part)) {
> > -			/*
> > -			 * The partition is already being removed,
> > -			 * the request will be accounted on the disk only
> > -			 *
> > -			 * We take a reference on disk->part0 although that
> > -			 * partition will never be deleted, so we can treat
> > -			 * it as any other partition.
> > -			 */
> > -			part = &rq->rq_disk->part0;
> > -			hd_struct_get(part);
> > -		}
> >   		part_inc_in_flight(rq->q, part, rw);
> >   		rq->part = part;
> >   	}
> > diff --git a/block/genhd.c b/block/genhd.c
> > index ff6268970ddc..e3dec90b1f43 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -286,17 +286,21 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> >   	ptbl = rcu_dereference(disk->part_tbl);
> >   	part = rcu_dereference(ptbl->last_lookup);
> > -	if (part && sector_in_part(part, sector))
> > +	if (part && sector_in_part(part, sector) && hd_struct_try_get(part))
> >   		return part;
> >   	for (i = 1; i < ptbl->len; i++) {
> >   		part = rcu_dereference(ptbl->part[i]);
> >   		if (part && sector_in_part(part, sector)) {
> > +                       if (!hd_struct_try_get(part))
> > +                               goto exit;
> >   			rcu_assign_pointer(ptbl->last_lookup, part);
> >   			return part;
> >   		}
> >   	}
> > + exit:
> > +	hd_struct_get(&disk->part0);
> >   	return &disk->part0;
> >   }
> >   EXPORT_SYMBOL_GPL(disk_map_sector_rcu);
> > diff --git a/block/partition-generic.c b/block/partition-generic.c
> > index 1d20c9cf213f..9ef6c13d5650 100644
> > --- a/block/partition-generic.c
> > +++ b/block/partition-generic.c
> > @@ -283,8 +283,8 @@ void delete_partition(struct gendisk *disk, int partno)
> >   	if (!part)
> >   		return;
> > -	rcu_assign_pointer(ptbl->part[partno], NULL);
> > -	rcu_assign_pointer(ptbl->last_lookup, NULL);
> > +	get_device(disk_to_dev(disk));
> > +
> >   	kobject_put(part->holder_dir);
> >   	device_del(part_to_dev(part));
> > @@ -296,6 +296,15 @@ void delete_partition(struct gendisk *disk, int partno)
> >   	 */
> >   	blk_invalidate_devt(part_devt(part));
> >   	hd_struct_kill(part);
> > +
> > +	/*
> > +	 * clear partition pointers after this partition is killed, then
> > +	 * IO path can't re-assign ->last_lookup any more
> > +	 */
> > +	rcu_assign_pointer(ptbl->part[partno], NULL);
> > +	rcu_assign_pointer(ptbl->last_lookup, NULL);
> > +
> > +	put_device(disk_to_dev(disk));
> >   }
> 
> This change may cannot solve follow case:
> 
> disk_map_sector_rcu     delete_partition  disk_map_sector_rcu
> hd_struct_try_get(part)
>                         hd_struct_kill
>                         last_lookup = NULL;
> last_lookup = part
> 
> 
> call_rcu
>                                            read last_lookup
> 
> free()
>                                            //use-after-free
>                                            sector_in_part(part, sector)
> 
> There is an interval between getting part and setting last_lookup
> in disk_map_sector_rcu(). If we kill the part and clear last_lookup
> at that interval, last_lookup will be re-assign again, which can cause
> use-after-free for readers after call_rcu.

OK, we still can move clearing .last_lookup into __delete_partition(),
at that time all IO path can observe the partition percpu-refcount killed.

Also the rcu work fn is run after one RCU grace period, at that time,
the NULL .last_lookup becomes visible in all IO path too.

diff --git a/block/blk-core.c b/block/blk-core.c
index 089e890ab208..79599f5fd5b7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1365,18 +1365,6 @@ void blk_account_io_start(struct request *rq, bool new_io)
 		part_stat_inc(part, merges[rw]);
 	} else {
 		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
-		if (!hd_struct_try_get(part)) {
-			/*
-			 * The partition is already being removed,
-			 * the request will be accounted on the disk only
-			 *
-			 * We take a reference on disk->part0 although that
-			 * partition will never be deleted, so we can treat
-			 * it as any other partition.
-			 */
-			part = &rq->rq_disk->part0;
-			hd_struct_get(part);
-		}
 		part_inc_in_flight(rq->q, part, rw);
 		rq->part = part;
 	}
diff --git a/block/genhd.c b/block/genhd.c
index ff6268970ddc..e3dec90b1f43 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -286,17 +286,21 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
 	ptbl = rcu_dereference(disk->part_tbl);
 
 	part = rcu_dereference(ptbl->last_lookup);
-	if (part && sector_in_part(part, sector))
+	if (part && sector_in_part(part, sector) && hd_struct_try_get(part))
 		return part;
 
 	for (i = 1; i < ptbl->len; i++) {
 		part = rcu_dereference(ptbl->part[i]);
 
 		if (part && sector_in_part(part, sector)) {
+                       if (!hd_struct_try_get(part))
+                               goto exit;
 			rcu_assign_pointer(ptbl->last_lookup, part);
 			return part;
 		}
 	}
+ exit:
+	hd_struct_get(&disk->part0);
 	return &disk->part0;
 }
 EXPORT_SYMBOL_GPL(disk_map_sector_rcu);
diff --git a/block/partition-generic.c b/block/partition-generic.c
index 1d20c9cf213f..1739f750dbf2 100644
--- a/block/partition-generic.c
+++ b/block/partition-generic.c
@@ -262,6 +262,12 @@ static void delete_partition_work_fn(struct work_struct *work)
 void __delete_partition(struct percpu_ref *ref)
 {
 	struct hd_struct *part = container_of(ref, struct hd_struct, ref);
+	struct disk_part_tbl *ptbl =
+		rcu_dereference_protected(part->disk->part_tbl, 1);
+
+	rcu_assign_pointer(ptbl->last_lookup, NULL);
+	put_device(disk_to_dev(part->disk));
+
 	INIT_RCU_WORK(&part->rcu_work, delete_partition_work_fn);
 	queue_rcu_work(system_wq, &part->rcu_work);
 }
@@ -283,8 +289,9 @@ void delete_partition(struct gendisk *disk, int partno)
 	if (!part)
 		return;
 
+	get_device(disk_to_dev(disk));
 	rcu_assign_pointer(ptbl->part[partno], NULL);
-	rcu_assign_pointer(ptbl->last_lookup, NULL);
+
 	kobject_put(part->holder_dir);
 	device_del(part_to_dev(part));
 
@@ -349,6 +356,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
 	p->nr_sects = len;
 	p->partno = partno;
 	p->policy = get_disk_ro(disk);
+	p->disk = disk;
 
 	if (info) {
 		struct partition_meta_info *pinfo = alloc_part_info(disk);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 8bb63027e4d6..66660ec5e8ee 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -129,6 +129,7 @@ struct hd_struct {
 #else
 	struct disk_stats dkstats;
 #endif
+	struct gendisk *disk;
 	struct percpu_ref ref;
 	struct rcu_work rcu_work;
 };


Thanks,
Ming


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2019-12-31 23:11   ` Paul E. McKenney
  2020-01-01  2:33     ` htbegin
@ 2020-01-03 23:45     ` Joel Fernandes
  2020-01-04  9:16       ` Hou Tao
  1 sibling, 1 reply; 20+ messages in thread
From: Joel Fernandes @ 2020-01-03 23:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Hou Tao, Yufen Yu, axboe, linux-block, ming.lei, hch, zhengchuan,
	yi.zhang, rcu

On Tue, Dec 31, 2019 at 03:11:58PM -0800, Paul E. McKenney wrote:
> On Tue, Dec 31, 2019 at 10:55:47PM +0800, Hou Tao wrote:
> > Hi,
> > 
> > On 2019/12/31 19:09, Yufen Yu wrote:
> > > When delete partition executes concurrently with IOs issue,
> > > it may cause use-after-free on part in disk_map_sector_rcu()
> > > as following:
> > snip
> > 
> > > 
> > > diff --git a/block/genhd.c b/block/genhd.c
> > > index ff6268970ddc..39fa8999905f 100644
> > > --- a/block/genhd.c
> > > +++ b/block/genhd.c
> > > @@ -293,7 +293,23 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> > >  		part = rcu_dereference(ptbl->part[i]);
> > >  
> > >  		if (part && sector_in_part(part, sector)) {
> > snip
> > 
> > >  			rcu_assign_pointer(ptbl->last_lookup, part);
> > > +			part = rcu_dereference(ptbl->part[i]);
> > > +			if (part == NULL) {
> > > +				rcu_assign_pointer(ptbl->last_lookup, NULL);
> > > +				break;
> > > +			}
> > >  			return part;
> > >  		}
> > >  	}
> > 
> > Not ensure whether the re-read can handle the following case or not:
> > 
> > process A                                 process B                          process C
> > 
> > disk_map_sector_rcu():                    delete_partition():               disk_map_sector_rcu():
> > 
> > rcu_read_lock
> > 
> >   // need to iterate partition table
> >   part[i] != NULL   (1)                   part[i] = NULL (2)
> >                                           smp_mb()
> >                                           last_lookup = NULL (3)
> >                                           call_rcu()  (4)
> >     last_lookup = part[i] (5)
> > 
> > 
> >                                                                              rcu_read_lock()
> >                                                                              read last_lookup return part[i] (6)
> >                                                                              sector_in_part() is OK (7)
> >                                                                              return part[i] (8)
> 
> Just for the record...
> 
> Use of RCU needs to ensure that readers cannot access the to-be-freed
> structure -before- invoking call_rcu().  Which does look to happen here
> with the "last_lookup = NULL".  But in addition, the callback needs to
> get access to the to-be-freed structure via some sideband (usually the
> structure passed to call_rcu()), not from the reader-accessible structure.
> 
> Or am I misinterpreting this sequence of events?

If I understand correctly, the issue described above is there are 2 threads
setting last_lookup pointer simultaneously, one of them is NULLing it and
waiting for a GP before freeing it (process B above), while the other is
assigning to it concurrently after it was just NULLed (process A). Meanwhile
process C starts a reader section *after* the GP by process B already started
and accesses the reassigned pointer causing use-after-free.

Did I miss something?

I believe the fix is what Tao already posted which is to use refcounts so
that the destructor does not free it while references are already held. Is
that what the final fix is going to be? That other thread is pretty long so I
lost track a bit..

thanks,

 - Joel



> 							Thanx, Paul
> 
> >   part[i] == NULL (9)
> >       last_lookup = NULL (10)
> >   rcu_read_unlock() (11)
> >                                            one RCU grace period completes
> >                                            __delete_partition() (12)
> >                                            free hd_partition (13)
> >                                                                              // use-after-free
> >                                                                              hd_struct_try_get(part[i])  (14)
> > 
> > * the number in the parenthesis is the sequence of events.
> > 
> > Maybe RCU experts can shed some light on this problem, so cc +paulmck@kernel.org, +joel@joelfernandes.org and +RCU maillist.
> > 
> > If the above case is possible, maybe we can fix the problem by pinning last_lookup through increasing its ref-count
> > (the following patch is only compile tested):
> > 
> > diff --git a/block/genhd.c b/block/genhd.c
> > index 6e8543ca6912..179e0056fae1 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -279,7 +279,14 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> >  		part = rcu_dereference(ptbl->part[i]);
> > 
> >  		if (part && sector_in_part(part, sector)) {
> > -			rcu_assign_pointer(ptbl->last_lookup, part);
> > +			struct hd_struct *old;
> > +
> > +			if (!hd_struct_try_get(part))
> > +				break;
> > +
> > +			old = xchg(&ptbl->last_lookup, part);
> > +			if (old)
> > +				hd_struct_put(old);
> >  			return part;
> >  		}
> >  	}
> > @@ -1231,7 +1238,11 @@ static void disk_replace_part_tbl(struct gendisk *disk,
> >  	rcu_assign_pointer(disk->part_tbl, new_ptbl);
> > 
> >  	if (old_ptbl) {
> > -		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
> > +		struct hd_struct *part;
> > +
> > +		part = xchg(&old_ptbl->last_lookup, NULL);
> > +		if (part)
> > +			hd_struct_put(part);
> >  		kfree_rcu(old_ptbl, rcu_head);
> >  	}
> >  }
> > diff --git a/block/partition-generic.c b/block/partition-generic.c
> > index 98d60a59b843..441c1c591c04 100644
> > --- a/block/partition-generic.c
> > +++ b/block/partition-generic.c
> > @@ -285,7 +285,8 @@ void delete_partition(struct gendisk *disk, int partno)
> >  		return;
> > 
> >  	rcu_assign_pointer(ptbl->part[partno], NULL);
> > -	rcu_assign_pointer(ptbl->last_lookup, NULL);
> > +	if (cmpxchg(&ptbl->last_lookup, part, NULL) == part)
> > +		hd_struct_put(part);
> >  	kobject_put(part->holder_dir);
> >  	device_del(part_to_dev(part));
> > 
> > -- 
> > 2.22.0
> > 
> > Regards,
> > Tao
> > 
> > 
> > > diff --git a/block/partition-generic.c b/block/partition-generic.c
> > > index 1d20c9cf213f..1e0065ed6f02 100644
> > > --- a/block/partition-generic.c
> > > +++ b/block/partition-generic.c
> > > @@ -284,6 +284,13 @@ void delete_partition(struct gendisk *disk, int partno)
> > >  		return;
> > >  
> > >  	rcu_assign_pointer(ptbl->part[partno], NULL);
> > > +	/*
> > > +	 * Without the memory barrier, disk_map_sector_rcu()
> > > +	 * may read the old value after overwriting the
> > > +	 * last_lookup. Then it can not clear last_lookup,
> > > +	 * which may cause use-after-free.
> > > +	 */
> > > +	smp_mb();
> > >  	rcu_assign_pointer(ptbl->last_lookup, NULL);
> > >  	kobject_put(part->holder_dir);
> > >  	device_del(part_to_dev(part));
> > > 
> > 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-03 23:45     ` Joel Fernandes
@ 2020-01-04  9:16       ` Hou Tao
  0 siblings, 0 replies; 20+ messages in thread
From: Hou Tao @ 2020-01-04  9:16 UTC (permalink / raw)
  To: Joel Fernandes, Paul E. McKenney
  Cc: Yufen Yu, axboe, linux-block, ming.lei, hch, zhengchuan, yi.zhang, rcu

Hi Joel,

On 2020/1/4 7:45, Joel Fernandes wrote:
> On Tue, Dec 31, 2019 at 03:11:58PM -0800, Paul E. McKenney wrote:
>> On Tue, Dec 31, 2019 at 10:55:47PM +0800, Hou Tao wrote:
>>> Hi,
>>>
>>> On 2019/12/31 19:09, Yufen Yu wrote:
>>>> When delete partition executes concurrently with IOs issue,
>>>> it may cause use-after-free on part in disk_map_sector_rcu()
>>>> as following:
>>> snip
>>>
>>>>
>>>> diff --git a/block/genhd.c b/block/genhd.c
>>>> index ff6268970ddc..39fa8999905f 100644
>>>> --- a/block/genhd.c
>>>> +++ b/block/genhd.c
>>>> @@ -293,7 +293,23 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
>>>>  		part = rcu_dereference(ptbl->part[i]);
>>>>  
>>>>  		if (part && sector_in_part(part, sector)) {
>>> snip
>>>
>>>>  			rcu_assign_pointer(ptbl->last_lookup, part);
>>>> +			part = rcu_dereference(ptbl->part[i]);
>>>> +			if (part == NULL) {
>>>> +				rcu_assign_pointer(ptbl->last_lookup, NULL);
>>>> +				break;
>>>> +			}
>>>>  			return part;
>>>>  		}
>>>>  	}
>>>
>>> Not ensure whether the re-read can handle the following case or not:
>>>
>>> process A                                 process B                          process C
>>>
>>> disk_map_sector_rcu():                    delete_partition():               disk_map_sector_rcu():
>>>
>>> rcu_read_lock
>>>
>>>   // need to iterate partition table
>>>   part[i] != NULL   (1)                   part[i] = NULL (2)
>>>                                           smp_mb()
>>>                                           last_lookup = NULL (3)
>>>                                           call_rcu()  (4)
>>>     last_lookup = part[i] (5)
>>>
>>>
>>>                                                                              rcu_read_lock()
>>>                                                                              read last_lookup return part[i] (6)
>>>                                                                              sector_in_part() is OK (7)
>>>                                                                              return part[i] (8)
>>
>> Just for the record...
>>
>> Use of RCU needs to ensure that readers cannot access the to-be-freed
>> structure -before- invoking call_rcu().  Which does look to happen here
>> with the "last_lookup = NULL".  But in addition, the callback needs to
>> get access to the to-be-freed structure via some sideband (usually the
>> structure passed to call_rcu()), not from the reader-accessible structure.
>>
>> Or am I misinterpreting this sequence of events?
> 
> If I understand correctly, the issue described above is there are 2 threads
> setting last_lookup pointer simultaneously, one of them is NULLing it and
> waiting for a GP before freeing it (process B above), while the other is
> assigning to it concurrently after it was just NULLed (process A). Meanwhile
> process C starts a reader section *after* the GP by process B already started
> and accesses the reassigned pointer causing use-after-free.
> 
> Did I miss something?
> 
No. It's exactly the same as you have summarized. And thanks for that.

> I believe the fix is what Tao already posted which is to use refcounts so
> that the destructor does not free it while references are already held. Is
> that what the final fix is going to be? That other thread is pretty long so I
> lost track a bit..
> We are just trying to find a better solution (e.g. more readable or understandable).

Regards,
Tao

> thanks,
> 
>  - Joel
> 
> 
> 
>> 							Thanx, Paul
>>
>>>   part[i] == NULL (9)
>>>       last_lookup = NULL (10)
>>>   rcu_read_unlock() (11)
>>>                                            one RCU grace period completes
>>>                                            __delete_partition() (12)
>>>                                            free hd_partition (13)
>>>                                                                              // use-after-free
>>>                                                                              hd_struct_try_get(part[i])  (14)
>>>
>>> * the number in the parenthesis is the sequence of events.
>>>
>>> Maybe RCU experts can shed some light on this problem, so cc +paulmck@kernel.org, +joel@joelfernandes.org and +RCU maillist.
>>>
>>> If the above case is possible, maybe we can fix the problem by pinning last_lookup through increasing its ref-count
>>> (the following patch is only compile tested):
>>>
>>> diff --git a/block/genhd.c b/block/genhd.c
>>> index 6e8543ca6912..179e0056fae1 100644
>>> --- a/block/genhd.c
>>> +++ b/block/genhd.c
>>> @@ -279,7 +279,14 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
>>>  		part = rcu_dereference(ptbl->part[i]);
>>>
>>>  		if (part && sector_in_part(part, sector)) {
>>> -			rcu_assign_pointer(ptbl->last_lookup, part);
>>> +			struct hd_struct *old;
>>> +
>>> +			if (!hd_struct_try_get(part))
>>> +				break;
>>> +
>>> +			old = xchg(&ptbl->last_lookup, part);
>>> +			if (old)
>>> +				hd_struct_put(old);
>>>  			return part;
>>>  		}
>>>  	}
>>> @@ -1231,7 +1238,11 @@ static void disk_replace_part_tbl(struct gendisk *disk,
>>>  	rcu_assign_pointer(disk->part_tbl, new_ptbl);
>>>
>>>  	if (old_ptbl) {
>>> -		rcu_assign_pointer(old_ptbl->last_lookup, NULL);
>>> +		struct hd_struct *part;
>>> +
>>> +		part = xchg(&old_ptbl->last_lookup, NULL);
>>> +		if (part)
>>> +			hd_struct_put(part);
>>>  		kfree_rcu(old_ptbl, rcu_head);
>>>  	}
>>>  }
>>> diff --git a/block/partition-generic.c b/block/partition-generic.c
>>> index 98d60a59b843..441c1c591c04 100644
>>> --- a/block/partition-generic.c
>>> +++ b/block/partition-generic.c
>>> @@ -285,7 +285,8 @@ void delete_partition(struct gendisk *disk, int partno)
>>>  		return;
>>>
>>>  	rcu_assign_pointer(ptbl->part[partno], NULL);
>>> -	rcu_assign_pointer(ptbl->last_lookup, NULL);
>>> +	if (cmpxchg(&ptbl->last_lookup, part, NULL) == part)
>>> +		hd_struct_put(part);
>>>  	kobject_put(part->holder_dir);
>>>  	device_del(part_to_dev(part));
>>>
>>> -- 
>>> 2.22.0
>>>
>>> Regards,
>>> Tao
>>>
>>>
>>>> diff --git a/block/partition-generic.c b/block/partition-generic.c
>>>> index 1d20c9cf213f..1e0065ed6f02 100644
>>>> --- a/block/partition-generic.c
>>>> +++ b/block/partition-generic.c
>>>> @@ -284,6 +284,13 @@ void delete_partition(struct gendisk *disk, int partno)
>>>>  		return;
>>>>  
>>>>  	rcu_assign_pointer(ptbl->part[partno], NULL);
>>>> +	/*
>>>> +	 * Without the memory barrier, disk_map_sector_rcu()
>>>> +	 * may read the old value after overwriting the
>>>> +	 * last_lookup. Then it can not clear last_lookup,
>>>> +	 * which may cause use-after-free.
>>>> +	 */
>>>> +	smp_mb();
>>>>  	rcu_assign_pointer(ptbl->last_lookup, NULL);
>>>>  	kobject_put(part->holder_dir);
>>>>  	device_del(part_to_dev(part));
>>>>
>>>
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-03 15:16               ` Ming Lei
@ 2020-01-06  7:39                 ` Yufen Yu
  2020-01-06  8:11                   ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Yufen Yu @ 2020-01-06  7:39 UTC (permalink / raw)
  To: Ming Lei
  Cc: Hou Tao, axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck,
	joel, rcu

Hi, Ming

On 2020/1/3 23:16, Ming Lei wrote:
> Hello Yufen,
> 
> OK, we still can move clearing .last_lookup into __delete_partition(),
> at that time all IO path can observe the partition percpu-refcount killed.
> 
> Also the rcu work fn is run after one RCU grace period, at that time,
> the NULL .last_lookup becomes visible in all IO path too.
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 089e890ab208..79599f5fd5b7 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1365,18 +1365,6 @@ void blk_account_io_start(struct request *rq, bool new_io)
>   		part_stat_inc(part, merges[rw]);
>   	} else {
>   		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> -		if (!hd_struct_try_get(part)) {
> -			/*
> -			 * The partition is already being removed,
> -			 * the request will be accounted on the disk only
> -			 *
> -			 * We take a reference on disk->part0 although that
> -			 * partition will never be deleted, so we can treat
> -			 * it as any other partition.
> -			 */
> -			part = &rq->rq_disk->part0;
> -			hd_struct_get(part);
> -		}
>   		part_inc_in_flight(rq->q, part, rw);
>   		rq->part = part;
>   	}
> diff --git a/block/genhd.c b/block/genhd.c
> index ff6268970ddc..e3dec90b1f43 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -286,17 +286,21 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
>   	ptbl = rcu_dereference(disk->part_tbl);
>   
>   	part = rcu_dereference(ptbl->last_lookup);
> -	if (part && sector_in_part(part, sector))
> +	if (part && sector_in_part(part, sector) && hd_struct_try_get(part))
>   		return part;
>   
>   	for (i = 1; i < ptbl->len; i++) {
>   		part = rcu_dereference(ptbl->part[i]);
>   
>   		if (part && sector_in_part(part, sector)) {
> +                       if (!hd_struct_try_get(part))
> +                               goto exit;
>   			rcu_assign_pointer(ptbl->last_lookup, part);
>   			return part;
>   		}
>   	}
> + exit:
> +	hd_struct_get(&disk->part0);
>   	return &disk->part0;
>   }
>   EXPORT_SYMBOL_GPL(disk_map_sector_rcu);
> diff --git a/block/partition-generic.c b/block/partition-generic.c
> index 1d20c9cf213f..1739f750dbf2 100644
> --- a/block/partition-generic.c
> +++ b/block/partition-generic.c
> @@ -262,6 +262,12 @@ static void delete_partition_work_fn(struct work_struct *work)
>   void __delete_partition(struct percpu_ref *ref)
>   {
>   	struct hd_struct *part = container_of(ref, struct hd_struct, ref);
> +	struct disk_part_tbl *ptbl =
> +		rcu_dereference_protected(part->disk->part_tbl, 1);
> +
> +	rcu_assign_pointer(ptbl->last_lookup, NULL);
> +	put_device(disk_to_dev(part->disk));
> +
>   	INIT_RCU_WORK(&part->rcu_work, delete_partition_work_fn);
>   	queue_rcu_work(system_wq, &part->rcu_work);
>   }
> @@ -283,8 +289,9 @@ void delete_partition(struct gendisk *disk, int partno)
>   	if (!part)
>   		return;
>   
> +	get_device(disk_to_dev(disk));
>   	rcu_assign_pointer(ptbl->part[partno], NULL);
> -	rcu_assign_pointer(ptbl->last_lookup, NULL);
> +
>   	kobject_put(part->holder_dir);
>   	device_del(part_to_dev(part));
>   
> @@ -349,6 +356,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
>   	p->nr_sects = len;
>   	p->partno = partno;
>   	p->policy = get_disk_ro(disk);
> +	p->disk = disk;
>   
>   	if (info) {
>   		struct partition_meta_info *pinfo = alloc_part_info(disk);
> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> index 8bb63027e4d6..66660ec5e8ee 100644
> --- a/include/linux/genhd.h
> +++ b/include/linux/genhd.h
> @@ -129,6 +129,7 @@ struct hd_struct {
>   #else
>   	struct disk_stats dkstats;
>   #endif
> +	struct gendisk *disk;
>   	struct percpu_ref ref;
>   	struct rcu_work rcu_work;
>   };


IMO, this change can solve the problem. But, __delete_partition will
depend on the implementation of disk_release(). If disk .release modify
as blocked in the future, then __delete_partition will also be blocked,
which is not expected in rcu callback function.

We may cache index of part[] instead of part[i] itself to fix the use-after-free bug.
https://patchwork.kernel.org/patch/11318767/

Thanks,
Yufen

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-06  7:39                 ` Yufen Yu
@ 2020-01-06  8:11                   ` Ming Lei
  2020-01-06  9:41                     ` Hou Tao
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2020-01-06  8:11 UTC (permalink / raw)
  To: Yufen Yu
  Cc: Hou Tao, axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck,
	joel, rcu

On Mon, Jan 06, 2020 at 03:39:07PM +0800, Yufen Yu wrote:
> Hi, Ming
> 
> On 2020/1/3 23:16, Ming Lei wrote:
> > Hello Yufen,
> > 
> > OK, we still can move clearing .last_lookup into __delete_partition(),
> > at that time all IO path can observe the partition percpu-refcount killed.
> > 
> > Also the rcu work fn is run after one RCU grace period, at that time,
> > the NULL .last_lookup becomes visible in all IO path too.
> > 
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 089e890ab208..79599f5fd5b7 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1365,18 +1365,6 @@ void blk_account_io_start(struct request *rq, bool new_io)
> >   		part_stat_inc(part, merges[rw]);
> >   	} else {
> >   		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> > -		if (!hd_struct_try_get(part)) {
> > -			/*
> > -			 * The partition is already being removed,
> > -			 * the request will be accounted on the disk only
> > -			 *
> > -			 * We take a reference on disk->part0 although that
> > -			 * partition will never be deleted, so we can treat
> > -			 * it as any other partition.
> > -			 */
> > -			part = &rq->rq_disk->part0;
> > -			hd_struct_get(part);
> > -		}
> >   		part_inc_in_flight(rq->q, part, rw);
> >   		rq->part = part;
> >   	}
> > diff --git a/block/genhd.c b/block/genhd.c
> > index ff6268970ddc..e3dec90b1f43 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -286,17 +286,21 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> >   	ptbl = rcu_dereference(disk->part_tbl);
> >   	part = rcu_dereference(ptbl->last_lookup);
> > -	if (part && sector_in_part(part, sector))
> > +	if (part && sector_in_part(part, sector) && hd_struct_try_get(part))
> >   		return part;
> >   	for (i = 1; i < ptbl->len; i++) {
> >   		part = rcu_dereference(ptbl->part[i]);
> >   		if (part && sector_in_part(part, sector)) {
> > +                       if (!hd_struct_try_get(part))
> > +                               goto exit;
> >   			rcu_assign_pointer(ptbl->last_lookup, part);
> >   			return part;
> >   		}
> >   	}
> > + exit:
> > +	hd_struct_get(&disk->part0);
> >   	return &disk->part0;
> >   }
> >   EXPORT_SYMBOL_GPL(disk_map_sector_rcu);
> > diff --git a/block/partition-generic.c b/block/partition-generic.c
> > index 1d20c9cf213f..1739f750dbf2 100644
> > --- a/block/partition-generic.c
> > +++ b/block/partition-generic.c
> > @@ -262,6 +262,12 @@ static void delete_partition_work_fn(struct work_struct *work)
> >   void __delete_partition(struct percpu_ref *ref)
> >   {
> >   	struct hd_struct *part = container_of(ref, struct hd_struct, ref);
> > +	struct disk_part_tbl *ptbl =
> > +		rcu_dereference_protected(part->disk->part_tbl, 1);
> > +
> > +	rcu_assign_pointer(ptbl->last_lookup, NULL);
> > +	put_device(disk_to_dev(part->disk));
> > +
> >   	INIT_RCU_WORK(&part->rcu_work, delete_partition_work_fn);
> >   	queue_rcu_work(system_wq, &part->rcu_work);
> >   }
> > @@ -283,8 +289,9 @@ void delete_partition(struct gendisk *disk, int partno)
> >   	if (!part)
> >   		return;
> > +	get_device(disk_to_dev(disk));
> >   	rcu_assign_pointer(ptbl->part[partno], NULL);
> > -	rcu_assign_pointer(ptbl->last_lookup, NULL);
> > +
> >   	kobject_put(part->holder_dir);
> >   	device_del(part_to_dev(part));
> > @@ -349,6 +356,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
> >   	p->nr_sects = len;
> >   	p->partno = partno;
> >   	p->policy = get_disk_ro(disk);
> > +	p->disk = disk;
> >   	if (info) {
> >   		struct partition_meta_info *pinfo = alloc_part_info(disk);
> > diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> > index 8bb63027e4d6..66660ec5e8ee 100644
> > --- a/include/linux/genhd.h
> > +++ b/include/linux/genhd.h
> > @@ -129,6 +129,7 @@ struct hd_struct {
> >   #else
> >   	struct disk_stats dkstats;
> >   #endif
> > +	struct gendisk *disk;
> >   	struct percpu_ref ref;
> >   	struct rcu_work rcu_work;
> >   };
> 
> 
> IMO, this change can solve the problem. But, __delete_partition will
> depend on the implementation of disk_release(). If disk .release modify
> as blocked in the future, then __delete_partition will also be blocked,
> which is not expected in rcu callback function.

__delete_partition() won't be blocked because it just calls queue_rcu_work() to
release the partition instance in wq context.

> 
> We may cache index of part[] instead of part[i] itself to fix the use-after-free bug.
> https://patchwork.kernel.org/patch/11318767/

That approach can fix the issue too, but extra overhead is added in the
fast path because partition retrieval is changed to the following way:

	+       last_lookup = READ_ONCE(ptbl->last_lookup);
	+       if (last_lookup > 0 && last_lookup < ptbl->len) {
	+               part = rcu_dereference(ptbl->part[last_lookup]);
	+               if (part && sector_in_part(part, sector))
	+                       return part;
	+       }

from 
	part = rcu_dereference(ptbl->last_lookup);

So ptbl->part[] has to be fetched, it is fine if the ->part[] array
shares same cacheline with ptbl->last_lookup, but one disk may have
too many partitions, then your approach may introduce one extra cache
miss every time.

READ_ONCE() may imply one read barrier too.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-06  8:11                   ` Ming Lei
@ 2020-01-06  9:41                     ` Hou Tao
  2020-01-06 10:05                       ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Hou Tao @ 2020-01-06  9:41 UTC (permalink / raw)
  To: Ming Lei, Yufen Yu
  Cc: axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck, joel, rcu

Hi,

On 2020/1/6 16:11, Ming Lei wrote:
> On Mon, Jan 06, 2020 at 03:39:07PM +0800, Yufen Yu wrote:
>> Hi, Ming
>>
>> On 2020/1/3 23:16, Ming Lei wrote:
>>> Hello Yufen,
>>>
>>> OK, we still can move clearing .last_lookup into __delete_partition(),
>>> at that time all IO path can observe the partition percpu-refcount killed.
>>>
>>> Also the rcu work fn is run after one RCU grace period, at that time,
>>> the NULL .last_lookup becomes visible in all IO path too.
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 089e890ab208..79599f5fd5b7 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -1365,18 +1365,6 @@ void blk_account_io_start(struct request *rq, bool new_io)
>>>   		part_stat_inc(part, merges[rw]);
>>>   	} else {
>>>   		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
>>> -		if (!hd_struct_try_get(part)) {
>>> -			/*
>>> -			 * The partition is already being removed,
>>> -			 * the request will be accounted on the disk only
>>> -			 *
>>> -			 * We take a reference on disk->part0 although that
>>> -			 * partition will never be deleted, so we can treat
>>> -			 * it as any other partition.
>>> -			 */
>>> -			part = &rq->rq_disk->part0;
>>> -			hd_struct_get(part);
>>> -		}
>>>   		part_inc_in_flight(rq->q, part, rw);
>>>   		rq->part = part;
>>>   	}
>>> diff --git a/block/genhd.c b/block/genhd.c
>>> index ff6268970ddc..e3dec90b1f43 100644
>>> --- a/block/genhd.c
>>> +++ b/block/genhd.c
>>> @@ -286,17 +286,21 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
>>>   	ptbl = rcu_dereference(disk->part_tbl);
>>>   	part = rcu_dereference(ptbl->last_lookup);
>>> -	if (part && sector_in_part(part, sector))
>>> +	if (part && sector_in_part(part, sector) && hd_struct_try_get(part))
>>>   		return part;
>>>   	for (i = 1; i < ptbl->len; i++) {
>>>   		part = rcu_dereference(ptbl->part[i]);
>>>   		if (part && sector_in_part(part, sector)) {
>>> +                       if (!hd_struct_try_get(part))
>>> +                               goto exit;
>>>   			rcu_assign_pointer(ptbl->last_lookup, part);
>>>   			return part;
>>>   		}
>>>   	}
>>> + exit:
>>> +	hd_struct_get(&disk->part0);
>>>   	return &disk->part0;
>>>   }
>>>   EXPORT_SYMBOL_GPL(disk_map_sector_rcu);
>>> diff --git a/block/partition-generic.c b/block/partition-generic.c
>>> index 1d20c9cf213f..1739f750dbf2 100644
>>> --- a/block/partition-generic.c
>>> +++ b/block/partition-generic.c
>>> @@ -262,6 +262,12 @@ static void delete_partition_work_fn(struct work_struct *work)
>>>   void __delete_partition(struct percpu_ref *ref)
>>>   {
>>>   	struct hd_struct *part = container_of(ref, struct hd_struct, ref);
>>> +	struct disk_part_tbl *ptbl =
>>> +		rcu_dereference_protected(part->disk->part_tbl, 1);
>>> +
>>> +	rcu_assign_pointer(ptbl->last_lookup, NULL);
>>> +	put_device(disk_to_dev(part->disk));
>>> +
>>>   	INIT_RCU_WORK(&part->rcu_work, delete_partition_work_fn);
>>>   	queue_rcu_work(system_wq, &part->rcu_work);
>>>   }
>>> @@ -283,8 +289,9 @@ void delete_partition(struct gendisk *disk, int partno)
>>>   	if (!part)
>>>   		return;
>>> +	get_device(disk_to_dev(disk));
>>>   	rcu_assign_pointer(ptbl->part[partno], NULL);
>>> -	rcu_assign_pointer(ptbl->last_lookup, NULL);
>>> +
>>>   	kobject_put(part->holder_dir);
>>>   	device_del(part_to_dev(part));
>>> @@ -349,6 +356,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
>>>   	p->nr_sects = len;
>>>   	p->partno = partno;
>>>   	p->policy = get_disk_ro(disk);
>>> +	p->disk = disk;
>>>   	if (info) {
>>>   		struct partition_meta_info *pinfo = alloc_part_info(disk);
>>> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
>>> index 8bb63027e4d6..66660ec5e8ee 100644
>>> --- a/include/linux/genhd.h
>>> +++ b/include/linux/genhd.h
>>> @@ -129,6 +129,7 @@ struct hd_struct {
>>>   #else
>>>   	struct disk_stats dkstats;
>>>   #endif
>>> +	struct gendisk *disk;
>>>   	struct percpu_ref ref;
>>>   	struct rcu_work rcu_work;
>>>   };
>>
>>
>> IMO, this change can solve the problem. But, __delete_partition will
>> depend on the implementation of disk_release(). If disk .release modify
>> as blocked in the future, then __delete_partition will also be blocked,
>> which is not expected in rcu callback function.
> 
> __delete_partition() won't be blocked because it just calls queue_rcu_work() to
> release the partition instance in wq context.
> 
>>
>> We may cache index of part[] instead of part[i] itself to fix the use-after-free bug.
>> https://patchwork.kernel.org/patch/11318767/
> 
> That approach can fix the issue too, but extra overhead is added in the
> fast path because partition retrieval is changed to the following way:
> 
> 	+       last_lookup = READ_ONCE(ptbl->last_lookup);
> 	+       if (last_lookup > 0 && last_lookup < ptbl->len) {
> 	+               part = rcu_dereference(ptbl->part[last_lookup]);
> 	+               if (part && sector_in_part(part, sector))
> 	+                       return part;
> 	+       }
> 
> from 
> 	part = rcu_dereference(ptbl->last_lookup);
> 
> So ptbl->part[] has to be fetched, it is fine if the ->part[] array
> shares same cacheline with ptbl->last_lookup, but one disk may have
> too many partitions, then your approach may introduce one extra cache
> miss every time.
> 
Yes. The solution you proposed also adds an invocation of percpu_ref_tryget_live()
in the fast path. Not sure which one will have a better performance. However the
reason we prefer the index caching is the simplicity instead of performance.

> READ_ONCE() may imply one read barrier too.
> 
According to the manual, it only matters on Alpha machines and READ_ONCE() also
exists in rcu_dereference().

Regards,
Tao

> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-06  9:41                     ` Hou Tao
@ 2020-01-06 10:05                       ` Ming Lei
  2020-01-07 11:40                         ` Hou Tao
  0 siblings, 1 reply; 20+ messages in thread
From: Ming Lei @ 2020-01-06 10:05 UTC (permalink / raw)
  To: Hou Tao
  Cc: Yufen Yu, axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck,
	joel, rcu

On Mon, Jan 06, 2020 at 05:41:45PM +0800, Hou Tao wrote:
> Hi,
> 
> On 2020/1/6 16:11, Ming Lei wrote:
> > On Mon, Jan 06, 2020 at 03:39:07PM +0800, Yufen Yu wrote:
> >> Hi, Ming
> >>
> >> On 2020/1/3 23:16, Ming Lei wrote:
> >>> Hello Yufen,
> >>>
> >>> OK, we still can move clearing .last_lookup into __delete_partition(),
> >>> at that time all IO path can observe the partition percpu-refcount killed.
> >>>
> >>> Also the rcu work fn is run after one RCU grace period, at that time,
> >>> the NULL .last_lookup becomes visible in all IO path too.
> >>>
> >>> diff --git a/block/blk-core.c b/block/blk-core.c
> >>> index 089e890ab208..79599f5fd5b7 100644
> >>> --- a/block/blk-core.c
> >>> +++ b/block/blk-core.c
> >>> @@ -1365,18 +1365,6 @@ void blk_account_io_start(struct request *rq, bool new_io)
> >>>   		part_stat_inc(part, merges[rw]);
> >>>   	} else {
> >>>   		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
> >>> -		if (!hd_struct_try_get(part)) {
> >>> -			/*
> >>> -			 * The partition is already being removed,
> >>> -			 * the request will be accounted on the disk only
> >>> -			 *
> >>> -			 * We take a reference on disk->part0 although that
> >>> -			 * partition will never be deleted, so we can treat
> >>> -			 * it as any other partition.
> >>> -			 */
> >>> -			part = &rq->rq_disk->part0;
> >>> -			hd_struct_get(part);
> >>> -		}
> >>>   		part_inc_in_flight(rq->q, part, rw);
> >>>   		rq->part = part;
> >>>   	}
> >>> diff --git a/block/genhd.c b/block/genhd.c
> >>> index ff6268970ddc..e3dec90b1f43 100644
> >>> --- a/block/genhd.c
> >>> +++ b/block/genhd.c
> >>> @@ -286,17 +286,21 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
> >>>   	ptbl = rcu_dereference(disk->part_tbl);
> >>>   	part = rcu_dereference(ptbl->last_lookup);
> >>> -	if (part && sector_in_part(part, sector))
> >>> +	if (part && sector_in_part(part, sector) && hd_struct_try_get(part))
> >>>   		return part;
> >>>   	for (i = 1; i < ptbl->len; i++) {
> >>>   		part = rcu_dereference(ptbl->part[i]);
> >>>   		if (part && sector_in_part(part, sector)) {
> >>> +                       if (!hd_struct_try_get(part))
> >>> +                               goto exit;
> >>>   			rcu_assign_pointer(ptbl->last_lookup, part);
> >>>   			return part;
> >>>   		}
> >>>   	}
> >>> + exit:
> >>> +	hd_struct_get(&disk->part0);
> >>>   	return &disk->part0;
> >>>   }
> >>>   EXPORT_SYMBOL_GPL(disk_map_sector_rcu);
> >>> diff --git a/block/partition-generic.c b/block/partition-generic.c
> >>> index 1d20c9cf213f..1739f750dbf2 100644
> >>> --- a/block/partition-generic.c
> >>> +++ b/block/partition-generic.c
> >>> @@ -262,6 +262,12 @@ static void delete_partition_work_fn(struct work_struct *work)
> >>>   void __delete_partition(struct percpu_ref *ref)
> >>>   {
> >>>   	struct hd_struct *part = container_of(ref, struct hd_struct, ref);
> >>> +	struct disk_part_tbl *ptbl =
> >>> +		rcu_dereference_protected(part->disk->part_tbl, 1);
> >>> +
> >>> +	rcu_assign_pointer(ptbl->last_lookup, NULL);
> >>> +	put_device(disk_to_dev(part->disk));
> >>> +
> >>>   	INIT_RCU_WORK(&part->rcu_work, delete_partition_work_fn);
> >>>   	queue_rcu_work(system_wq, &part->rcu_work);
> >>>   }
> >>> @@ -283,8 +289,9 @@ void delete_partition(struct gendisk *disk, int partno)
> >>>   	if (!part)
> >>>   		return;
> >>> +	get_device(disk_to_dev(disk));
> >>>   	rcu_assign_pointer(ptbl->part[partno], NULL);
> >>> -	rcu_assign_pointer(ptbl->last_lookup, NULL);
> >>> +
> >>>   	kobject_put(part->holder_dir);
> >>>   	device_del(part_to_dev(part));
> >>> @@ -349,6 +356,7 @@ struct hd_struct *add_partition(struct gendisk *disk, int partno,
> >>>   	p->nr_sects = len;
> >>>   	p->partno = partno;
> >>>   	p->policy = get_disk_ro(disk);
> >>> +	p->disk = disk;
> >>>   	if (info) {
> >>>   		struct partition_meta_info *pinfo = alloc_part_info(disk);
> >>> diff --git a/include/linux/genhd.h b/include/linux/genhd.h
> >>> index 8bb63027e4d6..66660ec5e8ee 100644
> >>> --- a/include/linux/genhd.h
> >>> +++ b/include/linux/genhd.h
> >>> @@ -129,6 +129,7 @@ struct hd_struct {
> >>>   #else
> >>>   	struct disk_stats dkstats;
> >>>   #endif
> >>> +	struct gendisk *disk;
> >>>   	struct percpu_ref ref;
> >>>   	struct rcu_work rcu_work;
> >>>   };
> >>
> >>
> >> IMO, this change can solve the problem. But, __delete_partition will
> >> depend on the implementation of disk_release(). If disk .release modify
> >> as blocked in the future, then __delete_partition will also be blocked,
> >> which is not expected in rcu callback function.
> > 
> > __delete_partition() won't be blocked because it just calls queue_rcu_work() to
> > release the partition instance in wq context.
> > 
> >>
> >> We may cache index of part[] instead of part[i] itself to fix the use-after-free bug.
> >> https://patchwork.kernel.org/patch/11318767/
> > 
> > That approach can fix the issue too, but extra overhead is added in the
> > fast path because partition retrieval is changed to the following way:
> > 
> > 	+       last_lookup = READ_ONCE(ptbl->last_lookup);
> > 	+       if (last_lookup > 0 && last_lookup < ptbl->len) {
> > 	+               part = rcu_dereference(ptbl->part[last_lookup]);
> > 	+               if (part && sector_in_part(part, sector))
> > 	+                       return part;
> > 	+       }
> > 
> > from 
> > 	part = rcu_dereference(ptbl->last_lookup);
> > 
> > So ptbl->part[] has to be fetched, it is fine if the ->part[] array
> > shares same cacheline with ptbl->last_lookup, but one disk may have
> > too many partitions, then your approach may introduce one extra cache
> > miss every time.
> > 
> Yes. The solution you proposed also adds an invocation of percpu_ref_tryget_live()
> in the fast path. Not sure which one will have a better performance. However the
> reason we prefer the index caching is the simplicity instead of performance.

No, hd_struct_try_get() and hd_struct_get() is always called once for one IO, the
patch I proposed changes nothing about this usage.

Please take a close look at the patch:

https://lore.kernel.org/linux-block/5cc465cc-d68c-088e-0729-2695279c7853@huawei.com/T/#m8f3e6b4e77eadf006ce142a84c966f50f3a9ae26

which just moves hd_struct_try_get() from blk_account_io_start() into
disk_map_sector_rcu(), doesn't it?


Thanks,
Ming


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-06 10:05                       ` Ming Lei
@ 2020-01-07 11:40                         ` Hou Tao
  2020-01-08  3:19                           ` Ming Lei
  0 siblings, 1 reply; 20+ messages in thread
From: Hou Tao @ 2020-01-07 11:40 UTC (permalink / raw)
  To: Ming Lei
  Cc: Yufen Yu, axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck,
	joel, rcu

Hi,

On 2020/1/6 18:05, Ming Lei wrote:
> On Mon, Jan 06, 2020 at 05:41:45PM +0800, Hou Tao wrote:
>> Hi,
>>
[snipped]

>> Yes. The solution you proposed also adds an invocation of percpu_ref_tryget_live()
>> in the fast path. Not sure which one will have a better performance. However the
>> reason we prefer the index caching is the simplicity instead of performance.
> 
> No, hd_struct_try_get() and hd_struct_get() is always called once for one IO, the
> patch I proposed changes nothing about this usage.
> 
> Please take a close look at the patch:
> 
> https://lore.kernel.org/linux-block/5cc465cc-d68c-088e-0729-2695279c7853@huawei.com/T/#m8f3e6b4e77eadf006ce142a84c966f50f3a9ae26
> 
> which just moves hd_struct_try_get() from blk_account_io_start() into
> disk_map_sector_rcu(), doesn't it?
> 
Yes, you are right. And a little suggestion for your patch:

@@ -283,8 +289,9 @@ void delete_partition(struct gendisk *disk, int partno)
 	if (!part)
 		return;

+	get_device(disk_to_dev(disk));
 	rcu_assign_pointer(ptbl->part[partno], NULL);
-	rcu_assign_pointer(ptbl->last_lookup, NULL);
+
 	kobject_put(part->holder_dir);
 	device_del(part_to_dev(part));

Could we move the call of get_device() into add_partition, and that will make the assignment of
disk to p->disk in add_partition() be natural ?

Maybe there is no need to add a new disk field in hd_struct, because the kobject of gendisk
is already the parent of hd_struct. But make use of part->__dev.parent after the calling
of device_del() is a bad_idea.

Regards,
Tao


> 
> Thanks,
> Ming
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] block: make sure last_lookup set as NULL after part deleted
  2020-01-07 11:40                         ` Hou Tao
@ 2020-01-08  3:19                           ` Ming Lei
  0 siblings, 0 replies; 20+ messages in thread
From: Ming Lei @ 2020-01-08  3:19 UTC (permalink / raw)
  To: Hou Tao
  Cc: Yufen Yu, axboe, linux-block, hch, zhengchuan, yi.zhang, paulmck,
	joel, rcu

On Tue, Jan 07, 2020 at 07:40:10PM +0800, Hou Tao wrote:
> Hi,
> 
> On 2020/1/6 18:05, Ming Lei wrote:
> > On Mon, Jan 06, 2020 at 05:41:45PM +0800, Hou Tao wrote:
> >> Hi,
> >>
> [snipped]
> 
> >> Yes. The solution you proposed also adds an invocation of percpu_ref_tryget_live()
> >> in the fast path. Not sure which one will have a better performance. However the
> >> reason we prefer the index caching is the simplicity instead of performance.
> > 
> > No, hd_struct_try_get() and hd_struct_get() is always called once for one IO, the
> > patch I proposed changes nothing about this usage.
> > 
> > Please take a close look at the patch:
> > 
> > https://lore.kernel.org/linux-block/5cc465cc-d68c-088e-0729-2695279c7853@huawei.com/T/#m8f3e6b4e77eadf006ce142a84c966f50f3a9ae26
> > 
> > which just moves hd_struct_try_get() from blk_account_io_start() into
> > disk_map_sector_rcu(), doesn't it?
> > 
> Yes, you are right. And a little suggestion for your patch:
> 
> @@ -283,8 +289,9 @@ void delete_partition(struct gendisk *disk, int partno)
>  	if (!part)
>  		return;
> 
> +	get_device(disk_to_dev(disk));
>  	rcu_assign_pointer(ptbl->part[partno], NULL);
> -	rcu_assign_pointer(ptbl->last_lookup, NULL);
> +
>  	kobject_put(part->holder_dir);
>  	device_del(part_to_dev(part));
> 
> Could we move the call of get_device() into add_partition, and that will make the assignment of
> disk to p->disk in add_partition() be natural ?

It isn't necessary to do that way, the parent disk's refcount isn't released
until device_del(part_to_dev(part)). So it is enough to hold disk's refcount
before calling device_del(part).

> 
> Maybe there is no need to add a new disk field in hd_struct, because the kobject of gendisk
> is already the parent of hd_struct. But make use of part->__dev.parent after the calling
> of device_del() is a bad_idea.

Yeah, that is why I added 'disk' field, which can be put with other
fields not accessed in fast path.

BTW, 'struct percpu_ref ref' should have been put just after 'nr_sects',
then these fast path fields can share single cache line.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2020-01-08  3:20 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20191231110945.10857-1-yuyufen@huawei.com>
2019-12-31 14:55 ` [PATCH] block: make sure last_lookup set as NULL after part deleted Hou Tao
2019-12-31 23:11   ` Paul E. McKenney
2020-01-01  2:33     ` htbegin
2020-01-01  3:39       ` htbegin
2020-01-03 23:45     ` Joel Fernandes
2020-01-04  9:16       ` Hou Tao
2020-01-02  1:23   ` Ming Lei
2020-01-03  3:06     ` Hou Tao
2020-01-03  4:18       ` Ming Lei
2020-01-03  7:35         ` Hou Tao
2020-01-03  8:17           ` Ming Lei
2020-01-03 12:03             ` Yufen Yu
2020-01-03 15:16               ` Ming Lei
2020-01-06  7:39                 ` Yufen Yu
2020-01-06  8:11                   ` Ming Lei
2020-01-06  9:41                     ` Hou Tao
2020-01-06 10:05                       ` Ming Lei
2020-01-07 11:40                         ` Hou Tao
2020-01-08  3:19                           ` Ming Lei
2020-01-03 12:43   ` Yufen Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).