linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 4.4-rc: 28 bioset threads on small notebook
@ 2015-12-11 10:49 Pavel Machek
  2015-12-11 14:08 ` Mike Snitzer
  0 siblings, 1 reply; 33+ messages in thread
From: Pavel Machek @ 2015-12-11 10:49 UTC (permalink / raw)
  To: kernel list; +Cc: axboe, hch, snitzer

Hi!

I know it is normal to spawn 8 threads for every single function,

root      1069  0.0  0.0      0     0 ?        S    Dec08   0:00
[scsi_eh_0]
root      1070  0.0  0.0      0     0 ?        S<   Dec08   0:00
[scsi_tmf_0]
root      1073  0.0  0.0      0     0 ?        S    Dec08   0:00
[scsi_eh_1]
root      1074  0.0  0.0      0     0 ?        S<   Dec08   0:00
[scsi_tmf_1]
root      1077  0.0  0.0      0     0 ?        S    Dec08   0:00
[scsi_eh_2]
root      1078  0.0  0.0      0     0 ?        S<   Dec08   0:00
[scsi_tmf_2]
root      1081  0.0  0.0      0     0 ?        S    Dec08   0:00
[scsi_eh_3]
root      1082  0.0  0.0      0     0 ?        S<   Dec08   0:00
[scsi_tmf_3]
root      1085  0.0  0.0      0     0 ?        S    Dec08   0:00
[scsi_eh_4]
root      1087  0.0  0.0      0     0 ?        S<   Dec08   0:00
[scsi_tmf_4]
root      1101  0.0  0.0      0     0 ?        S    Dec08   0:00
[scsi_eh_5]
root      1102  0.0  0.0      0     0 ?        S<   Dec08   0:00
[scsi_tmf_5]
root      2741  0.0  0.0      0     0 ?        S    Dec08   0:00 [nfsd]
root      2742  0.0  0.0      0     0 ?        S    Dec08   0:00 [nfsd]
root      2743  0.0  0.0      0     0 ?        S    Dec08   0:00 [nfsd]
root      2744  0.0  0.0      0     0 ?        S    Dec08   0:00 [nfsd]
root      2745  0.0  0.0      0     0 ?        S    Dec08   0:00 [nfsd]
root      2746  0.0  0.0      0     0 ?        S    Dec08   0:00 [nfsd]
root      2747  0.0  0.0      0     0 ?        S    Dec08   0:00 [nfsd]
root      2748  0.0  0.0      0     0 ?        S    Dec08   0:00 [nfsd]


but 28 threads?

root       974  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root       977  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root       980  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root       983  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root       986  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root       989  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root       992  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root       995  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1000  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1002  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1003  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1004  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1005  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1006  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1007  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1008  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1009  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1010  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1011  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1012  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1013  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1014  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1015  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
root      1016  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]

?

I'm not even using btrfs.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-rc: 28 bioset threads on small notebook
  2015-12-11 10:49 4.4-rc: 28 bioset threads on small notebook Pavel Machek
@ 2015-12-11 14:08 ` Mike Snitzer
  2015-12-11 17:14   ` Pavel Machek
  2016-02-20 17:40   ` 4.4-final: " Pavel Machek
  0 siblings, 2 replies; 33+ messages in thread
From: Mike Snitzer @ 2015-12-11 14:08 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, axboe, hch

On Fri, Dec 11 2015 at  5:49am -0500,
Pavel Machek <pavel@ucw.cz> wrote:

> Hi!
> 
> I know it is normal to spawn 8 threads for every single function,
...
> but 28 threads?
> 
> root       974  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
...

How many physical block devices do you have?

DM is doing its part to not contribute to this:
dbba42d8a ("dm: eliminate unused "bioset" process for each bio-based DM device")

(but yeah, all these extra 'bioset' threads aren't ideal)

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-rc: 28 bioset threads on small notebook
  2015-12-11 14:08 ` Mike Snitzer
@ 2015-12-11 17:14   ` Pavel Machek
  2016-02-20 17:40   ` 4.4-final: " Pavel Machek
  1 sibling, 0 replies; 33+ messages in thread
From: Pavel Machek @ 2015-12-11 17:14 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: kernel list, axboe, hch

Hi!

> > Hi!
> > 
> > I know it is normal to spawn 8 threads for every single function,
> ...
> > but 28 threads?
> > 
> > root       974  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
> ...
> 
> How many physical block devices do you have?
> 
> DM is doing its part to not contribute to this:
> dbba42d8a ("dm: eliminate unused "bioset" process for each bio-based DM device")
> 
> (but yeah, all these extra 'bioset' threads aren't ideal)

Thinkpad X60. One sda, and on SD card slot. I guess I did use the SD
card from boot, and did few suspend/resume cycles...
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* 4.4-final: 28 bioset threads on small notebook
  2015-12-11 14:08 ` Mike Snitzer
  2015-12-11 17:14   ` Pavel Machek
@ 2016-02-20 17:40   ` Pavel Machek
  2016-02-20 18:42     ` Pavel Machek
  1 sibling, 1 reply; 33+ messages in thread
From: Pavel Machek @ 2016-02-20 17:40 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: kernel list, axboe, hch


On Fri 2015-12-11 09:08:41, Mike Snitzer wrote:
> On Fri, Dec 11 2015 at  5:49am -0500,
> Pavel Machek <pavel@ucw.cz> wrote:
> 
> > Hi!
> > 
> > I know it is normal to spawn 8 threads for every single function,
> ...
> > but 28 threads?
> > 
> > root       974  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
> ...
> 
> How many physical block devices do you have?
> 
> DM is doing its part to not contribute to this:
> dbba42d8a ("dm: eliminate unused "bioset" process for each bio-based DM device")
> 
> (but yeah, all these extra 'bioset' threads aren't ideal)

Still there in 4.4-final.

Block devices:
259 blkext
  7 loop
 8 sd
 9 md
 11 sr
 43 nbd
 65 sd
 66 sd
 67 sd
 68 sd
 69 sd
 70 sd
 71 sd
	       128 sd
	       129 sd
	       130 sd
	       131 sd
	       132 sd
	       133 sd
	       134 sd
	       135 sd
	       179 mmc
	       251 device-mapper
	       252 bcache
	       253 pktcdvd
	       254 mdp
	       

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-20 17:40   ` 4.4-final: " Pavel Machek
@ 2016-02-20 18:42     ` Pavel Machek
  2016-02-20 19:51       ` Mike Snitzer
  0 siblings, 1 reply; 33+ messages in thread
From: Pavel Machek @ 2016-02-20 18:42 UTC (permalink / raw)
  To: Mike Snitzer; +Cc: kernel list, axboe, hch

On Sat 2016-02-20 18:40:35, Pavel Machek wrote:
> 
> On Fri 2015-12-11 09:08:41, Mike Snitzer wrote:
> > On Fri, Dec 11 2015 at  5:49am -0500,
> > Pavel Machek <pavel@ucw.cz> wrote:
> > 
> > > Hi!
> > > 
> > > I know it is normal to spawn 8 threads for every single function,
> > ...
> > > but 28 threads?
> > > 
> > > root       974  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
> > ...
> > 
> > How many physical block devices do you have?
> > 
> > DM is doing its part to not contribute to this:
> > dbba42d8a ("dm: eliminate unused "bioset" process for each bio-based DM device")
> > 
> > (but yeah, all these extra 'bioset' threads aren't ideal)
> 
> Still there in 4.4-final.

...and still there in 4.5-rc4 :-(.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-20 18:42     ` Pavel Machek
@ 2016-02-20 19:51       ` Mike Snitzer
  2016-02-20 20:04         ` Pavel Machek
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Snitzer @ 2016-02-20 19:51 UTC (permalink / raw)
  To: Pavel Machek; +Cc: kernel list, axboe, hch

On Sat, Feb 20 2016 at  1:42pm -0500,
Pavel Machek <pavel@ucw.cz> wrote:

> On Sat 2016-02-20 18:40:35, Pavel Machek wrote:
> > 
> > On Fri 2015-12-11 09:08:41, Mike Snitzer wrote:
> > > On Fri, Dec 11 2015 at  5:49am -0500,
> > > Pavel Machek <pavel@ucw.cz> wrote:
> > > 
> > > > Hi!
> > > > 
> > > > I know it is normal to spawn 8 threads for every single function,
> > > ...
> > > > but 28 threads?
> > > > 
> > > > root       974  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
> > > ...
> > > 
> > > How many physical block devices do you have?
> > > 
> > > DM is doing its part to not contribute to this:
> > > dbba42d8a ("dm: eliminate unused "bioset" process for each bio-based DM device")
> > > 
> > > (but yeah, all these extra 'bioset' threads aren't ideal)
> > 
> > Still there in 4.4-final.
> 
> ...and still there in 4.5-rc4 :-(.
> 									Pavel

You're directing this concern to the wrong person.

I already told you DM is _not_ contributing any extra "bioset" threads
(ever since commit dbba42d8a).

But in general, these "bioset" threads are a side-effect of the
late-bio-splitting support.  So is your position on it: "I don't like
that feature if it comes at the expense of adding resources I can _see_
for something I (naively?) view as useless"?

Just seems... naive... but you could be trying to say something else
entirely.

Anyway, if you don't like something: understand why it is there and then
try to fix it to your liking (without compromising why it was there to
begin with).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-20 19:51       ` Mike Snitzer
@ 2016-02-20 20:04         ` Pavel Machek
  2016-02-20 20:38           ` Mike Snitzer
  2017-02-06 12:53           ` v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone Pavel Machek
  0 siblings, 2 replies; 33+ messages in thread
From: Pavel Machek @ 2016-02-20 20:04 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: kernel list, axboe, hch, kent.overstreet, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger

Hi!

> > > > > I know it is normal to spawn 8 threads for every single function,
> > > > ...
> > > > > but 28 threads?
> > > > > 
> > > > > root       974  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
> > > > ...
> > > > 
> > > > How many physical block devices do you have?
> > > > 
> > > > DM is doing its part to not contribute to this:
> > > > dbba42d8a ("dm: eliminate unused "bioset" process for each bio-based DM device")
> > > > 
> > > > (but yeah, all these extra 'bioset' threads aren't ideal)
> > > 
> > > Still there in 4.4-final.
> > 
> > ...and still there in 4.5-rc4 :-(.
> 
> You're directing this concern to the wrong person.
> 
> I already told you DM is _not_ contributing any extra "bioset" threads
> (ever since commit dbba42d8a).

Well, sorry about that. Note that l-k is on the cc list, so hopefully
the right person sees it too.

Ok, let me check... it seems that 
54efd50bfd873e2dbf784e0b21a8027ba4299a3e is responsible, thus Kent
Overstreet <kent.overstreet@gmail.com> is to blame.

Um, and you acked the patch, so you are partly responsible.

> But in general, these "bioset" threads are a side-effect of the
> late-bio-splitting support.  So is your position on it: "I don't like
> that feature if it comes at the expense of adding resources I can _see_
> for something I (naively?) view as useless"?

> Just seems... naive... but you could be trying to say something else
> entirely.

> Anyway, if you don't like something: understand why it is there and then
> try to fix it to your liking (without compromising why it was there to
> begin with).

Well, 28 kernel threads on a notebook is a bug, plain and simple. Do
you argue it is not?

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-20 20:04         ` Pavel Machek
@ 2016-02-20 20:38           ` Mike Snitzer
  2016-02-20 20:55             ` Pavel Machek
  2017-02-06 12:53           ` v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone Pavel Machek
  1 sibling, 1 reply; 33+ messages in thread
From: Mike Snitzer @ 2016-02-20 20:38 UTC (permalink / raw)
  To: Pavel Machek
  Cc: kernel list, axboe, hch, kent.overstreet, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger

On Sat, Feb 20 2016 at  3:04pm -0500,
Pavel Machek <pavel@ucw.cz> wrote:

> Hi!
> 
> > > > > > I know it is normal to spawn 8 threads for every single function,
> > > > > ...
> > > > > > but 28 threads?
> > > > > > 
> > > > > > root       974  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
> > > > > ...
> > > > > 
> > > > > How many physical block devices do you have?
> > > > > 
> > > > > DM is doing its part to not contribute to this:
> > > > > dbba42d8a ("dm: eliminate unused "bioset" process for each bio-based DM device")
> > > > > 
> > > > > (but yeah, all these extra 'bioset' threads aren't ideal)
> > > > 
> > > > Still there in 4.4-final.
> > > 
> > > ...and still there in 4.5-rc4 :-(.
> > 
> > You're directing this concern to the wrong person.
> > 
> > I already told you DM is _not_ contributing any extra "bioset" threads
> > (ever since commit dbba42d8a).
> 
> Well, sorry about that. Note that l-k is on the cc list, so hopefully
> the right person sees it too.
> 
> Ok, let me check... it seems that 
> 54efd50bfd873e2dbf784e0b21a8027ba4299a3e is responsible, thus Kent
> Overstreet <kent.overstreet@gmail.com> is to blame.
> 
> Um, and you acked the patch, so you are partly responsible.

You still haven't shown you even understand the patch so don't try to
blame me for one aspect you don't like.
 
> > But in general, these "bioset" threads are a side-effect of the
> > late-bio-splitting support.  So is your position on it: "I don't like
> > that feature if it comes at the expense of adding resources I can _see_
> > for something I (naively?) view as useless"?
> 
> > Just seems... naive... but you could be trying to say something else
> > entirely.
> 
> > Anyway, if you don't like something: understand why it is there and then
> > try to fix it to your liking (without compromising why it was there to
> > begin with).
> 
> Well, 28 kernel threads on a notebook is a bug, plain and simple. Do
> you argue it is not?

Just implies you have 28 request_queues right?  You clearly have
something else going on on your notebook than the average notebook user.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-20 20:38           ` Mike Snitzer
@ 2016-02-20 20:55             ` Pavel Machek
  2016-02-21  4:15               ` Kent Overstreet
  0 siblings, 1 reply; 33+ messages in thread
From: Pavel Machek @ 2016-02-20 20:55 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: kernel list, axboe, hch, kent.overstreet, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger

Hi!

> > > You're directing this concern to the wrong person.
> > > 
> > > I already told you DM is _not_ contributing any extra "bioset" threads
> > > (ever since commit dbba42d8a).
> > 
> > Well, sorry about that. Note that l-k is on the cc list, so hopefully
> > the right person sees it too.
> > 
> > Ok, let me check... it seems that 
> > 54efd50bfd873e2dbf784e0b21a8027ba4299a3e is responsible, thus Kent
> > Overstreet <kent.overstreet@gmail.com> is to blame.
> > 
> > Um, and you acked the patch, so you are partly responsible.
> 
> You still haven't shown you even understand the patch so don't try to
> blame me for one aspect you don't like.

Well, I don't have to understand the patch to argue its wrong.

> > > But in general, these "bioset" threads are a side-effect of the
> > > late-bio-splitting support.  So is your position on it: "I don't like
> > > that feature if it comes at the expense of adding resources I can _see_
> > > for something I (naively?) view as useless"?
> > 
> > > Just seems... naive... but you could be trying to say something else
> > > entirely.
> > 
> > > Anyway, if you don't like something: understand why it is there and then
> > > try to fix it to your liking (without compromising why it was there to
> > > begin with).
> > 
> > Well, 28 kernel threads on a notebook is a bug, plain and simple. Do
> > you argue it is not?
> 
> Just implies you have 28 request_queues right?  You clearly have
> something else going on on your notebook than the average notebook
> user.

I'm not using the modules, but otherwise I'm not doing anything
special. How many request_queues should I expect? How many do you have
on your notebook?

Thanks,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-20 20:55             ` Pavel Machek
@ 2016-02-21  4:15               ` Kent Overstreet
  2016-02-21  6:43                 ` Ming Lin-SSI
  0 siblings, 1 reply; 33+ messages in thread
From: Kent Overstreet @ 2016-02-21  4:15 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Mike Snitzer, kernel list, axboe, hch, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger

On Sat, Feb 20, 2016 at 09:55:19PM +0100, Pavel Machek wrote:
> Hi!
> 
> > > > You're directing this concern to the wrong person.
> > > > 
> > > > I already told you DM is _not_ contributing any extra "bioset" threads
> > > > (ever since commit dbba42d8a).
> > > 
> > > Well, sorry about that. Note that l-k is on the cc list, so hopefully
> > > the right person sees it too.
> > > 
> > > Ok, let me check... it seems that 
> > > 54efd50bfd873e2dbf784e0b21a8027ba4299a3e is responsible, thus Kent
> > > Overstreet <kent.overstreet@gmail.com> is to blame.
> > > 
> > > Um, and you acked the patch, so you are partly responsible.
> > 
> > You still haven't shown you even understand the patch so don't try to
> > blame me for one aspect you don't like.
> 
> Well, I don't have to understand the patch to argue its wrong.
> 
> > > > But in general, these "bioset" threads are a side-effect of the
> > > > late-bio-splitting support.  So is your position on it: "I don't like
> > > > that feature if it comes at the expense of adding resources I can _see_
> > > > for something I (naively?) view as useless"?
> > > 
> > > > Just seems... naive... but you could be trying to say something else
> > > > entirely.
> > > 
> > > > Anyway, if you don't like something: understand why it is there and then
> > > > try to fix it to your liking (without compromising why it was there to
> > > > begin with).
> > > 
> > > Well, 28 kernel threads on a notebook is a bug, plain and simple. Do
> > > you argue it is not?
> > 
> > Just implies you have 28 request_queues right?  You clearly have
> > something else going on on your notebook than the average notebook
> > user.
> 
> I'm not using the modules, but otherwise I'm not doing anything
> special. How many request_queues should I expect? How many do you have
> on your notebook?

It's one rescuer thread per bio_set, not one per request queue, so 28 is more
than I'd expect but there's lots of random bio_sets so it's not entirely
unexpected.

It'd be better to have the rescuers be per request_queue, just someone is going
to have to write the code.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: 4.4-final: 28 bioset threads on small notebook
  2016-02-21  4:15               ` Kent Overstreet
@ 2016-02-21  6:43                 ` Ming Lin-SSI
  2016-02-21  9:40                   ` Ming Lei
  0 siblings, 1 reply; 33+ messages in thread
From: Ming Lin-SSI @ 2016-02-21  6:43 UTC (permalink / raw)
  To: Kent Overstreet, Pavel Machek
  Cc: Mike Snitzer, kernel list, axboe, hch, neilb, martin.petersen,
	dpark, dm-devel, ming.lei, agk, jkosina, geoff, jim, pjk1939,
	minchan, ngupta, oleg.drokin, andreas.dilger

>-----Original Message-----
>From: Kent Overstreet [mailto:kent.overstreet@gmail.com]
>
>On Sat, Feb 20, 2016 at 09:55:19PM +0100, Pavel Machek wrote:
>> Hi!
>>
>> > > > You're directing this concern to the wrong person.
>> > > >
>> > > > I already told you DM is _not_ contributing any extra "bioset" threads
>> > > > (ever since commit dbba42d8a).
>> > >
>> > > Well, sorry about that. Note that l-k is on the cc list, so hopefully
>> > > the right person sees it too.
>> > >
>> > > Ok, let me check... it seems that
>> > > 54efd50bfd873e2dbf784e0b21a8027ba4299a3e is responsible, thus Kent
>> > > Overstreet <kent.overstreet@gmail.com> is to blame.
>> > >
>> > > Um, and you acked the patch, so you are partly responsible.
>> >
>> > You still haven't shown you even understand the patch so don't try to
>> > blame me for one aspect you don't like.
>>
>> Well, I don't have to understand the patch to argue its wrong.
>>
>> > > > But in general, these "bioset" threads are a side-effect of the
>> > > > late-bio-splitting support.  So is your position on it: "I don't like
>> > > > that feature if it comes at the expense of adding resources I can _see_
>> > > > for something I (naively?) view as useless"?
>> > >
>> > > > Just seems... naive... but you could be trying to say something else
>> > > > entirely.
>> > >
>> > > > Anyway, if you don't like something: understand why it is there and
>then
>> > > > try to fix it to your liking (without compromising why it was there to
>> > > > begin with).
>> > >
>> > > Well, 28 kernel threads on a notebook is a bug, plain and simple. Do
>> > > you argue it is not?
>> >
>> > Just implies you have 28 request_queues right?  You clearly have
>> > something else going on on your notebook than the average notebook
>> > user.
>>
>> I'm not using the modules, but otherwise I'm not doing anything
>> special. How many request_queues should I expect? How many do you
>have
>> on your notebook?
>
>It's one rescuer thread per bio_set, not one per request queue, so 28 is more
>than I'd expect but there's lots of random bio_sets so it's not entirely
>unexpected.
>
>It'd be better to have the rescuers be per request_queue, just someone is
>going
>to have to write the code.

I boot a VM and it also has 28 bioset threads.

That's because I have 27 block devices.

root@wheezy:~# ls /sys/block/
loop0  loop2  loop4  loop6  ram0  ram10  ram12  ram14  ram2  ram4  ram6  ram8  sr0  vdb
loop1  loop3  loop5  loop7  ram1  ram11  ram13  ram15  ram3  ram5  ram7  ram9  vda

And the additional one comes from init_bio

[    0.329627] Call Trace:
[    0.329970]  [<ffffffff813b132c>] dump_stack+0x63/0x87
[    0.330531]  [<ffffffff81377e7e>] __bioset_create+0x29e/0x2b0
[    0.331127]  [<ffffffff81d97896>] ? ca_keys_setup+0xa6/0xa6
[    0.331735]  [<ffffffff81d97937>] init_bio+0xa1/0xd1
[    0.332284]  [<ffffffff8100213d>] do_one_initcall+0xcd/0x1f0
[    0.332883]  [<ffffffff810972b6>] ? parse_args+0x296/0x480
[    0.333460]  [<ffffffff81d56297>] kernel_init_freeable+0x16f/0x1fa
[    0.334131]  [<ffffffff81d55999>] ? initcall_blacklist+0xba/0xba
[    0.334747]  [<ffffffff8177d970>] ? rest_init+0x80/0x80
[    0.335301]  [<ffffffff8177d97e>] kernel_init+0xe/0xf0
[    0.335842]  [<ffffffff81789dcf>] ret_from_fork+0x3f/0x70
[    0.336371]  [<ffffffff8177d970>] ? rest_init+0x80/0x80

So it's almost already "per request_queue"

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-21  6:43                 ` Ming Lin-SSI
@ 2016-02-21  9:40                   ` Ming Lei
  2016-02-22 22:58                     ` Kent Overstreet
  0 siblings, 1 reply; 33+ messages in thread
From: Ming Lei @ 2016-02-21  9:40 UTC (permalink / raw)
  To: Ming Lin-SSI
  Cc: Kent Overstreet, Pavel Machek, Mike Snitzer, kernel list, axboe,
	hch, neilb, martin.petersen, dpark, dm-devel, agk, jkosina,
	geoff, jim, pjk1939, minchan, ngupta, oleg.drokin,
	andreas.dilger

On Sun, Feb 21, 2016 at 2:43 PM, Ming Lin-SSI <ming.l@ssi.samsung.com> wrote:
>>-----Original Message-----
>
> So it's almost already "per request_queue"

Yes, that is because of the following line:

q->bio_split = bioset_create(BIO_POOL_SIZE, 0);

in blk_alloc_queue_node().

Looks like this bio_set doesn't need to be per-request_queue, and
now it is only used for fast-cloning bio for splitting, and one global
split bio_set should be enough.


thanks,
Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-21  9:40                   ` Ming Lei
@ 2016-02-22 22:58                     ` Kent Overstreet
  2016-02-23  2:55                       ` Ming Lei
  2016-02-23 20:45                       ` Pavel Machek
  0 siblings, 2 replies; 33+ messages in thread
From: Kent Overstreet @ 2016-02-22 22:58 UTC (permalink / raw)
  To: Ming Lei
  Cc: Ming Lin-SSI, Pavel Machek, Mike Snitzer, kernel list, axboe,
	hch, neilb, martin.petersen, dpark, dm-devel, agk, jkosina,
	geoff, jim, pjk1939, minchan, ngupta, oleg.drokin,
	andreas.dilger

On Sun, Feb 21, 2016 at 05:40:59PM +0800, Ming Lei wrote:
> On Sun, Feb 21, 2016 at 2:43 PM, Ming Lin-SSI <ming.l@ssi.samsung.com> wrote:
> >>-----Original Message-----
> >
> > So it's almost already "per request_queue"
> 
> Yes, that is because of the following line:
> 
> q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
> 
> in blk_alloc_queue_node().
> 
> Looks like this bio_set doesn't need to be per-request_queue, and
> now it is only used for fast-cloning bio for splitting, and one global
> split bio_set should be enough.

It does have to be per request queue for stacking block devices (which includes
loopback).

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-22 22:58                     ` Kent Overstreet
@ 2016-02-23  2:55                       ` Ming Lei
  2016-02-23 14:54                         ` Mike Snitzer
  2016-02-23 20:45                       ` Pavel Machek
  1 sibling, 1 reply; 33+ messages in thread
From: Ming Lei @ 2016-02-23  2:55 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Ming Lin-SSI, Pavel Machek, Mike Snitzer, kernel list, axboe,
	hch, neilb, martin.petersen, dpark, dm-devel, agk, jkosina,
	geoff, jim, pjk1939, minchan, ngupta, oleg.drokin,
	andreas.dilger

On Tue, Feb 23, 2016 at 6:58 AM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Sun, Feb 21, 2016 at 05:40:59PM +0800, Ming Lei wrote:
>> On Sun, Feb 21, 2016 at 2:43 PM, Ming Lin-SSI <ming.l@ssi.samsung.com> wrote:
>> >>-----Original Message-----
>> >
>> > So it's almost already "per request_queue"
>>
>> Yes, that is because of the following line:
>>
>> q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
>>
>> in blk_alloc_queue_node().
>>
>> Looks like this bio_set doesn't need to be per-request_queue, and
>> now it is only used for fast-cloning bio for splitting, and one global
>> split bio_set should be enough.
>
> It does have to be per request queue for stacking block devices (which includes
> loopback).

In commit df2cb6daa4(block: Avoid deadlocks with bio allocation by
stacking drivers), deadlock in this situation has been avoided already.
Or are there other issues with global bio_set? I appreciate if you may
explain it a bit if there are.

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-23  2:55                       ` Ming Lei
@ 2016-02-23 14:54                         ` Mike Snitzer
  2016-02-24  2:48                           ` Ming Lei
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Snitzer @ 2016-02-23 14:54 UTC (permalink / raw)
  To: Ming Lei
  Cc: Kent Overstreet, oleg.drokin, Ming Lin-SSI, andreas.dilger,
	martin.petersen, minchan, jkosina, kernel list, jim, pjk1939,
	axboe, geoff, dm-devel, dpark, Pavel Machek, ngupta, hch, agk

On Mon, Feb 22 2016 at  9:55pm -0500,
Ming Lei <ming.lei@canonical.com> wrote:

> On Tue, Feb 23, 2016 at 6:58 AM, Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> > On Sun, Feb 21, 2016 at 05:40:59PM +0800, Ming Lei wrote:
> >> On Sun, Feb 21, 2016 at 2:43 PM, Ming Lin-SSI <ming.l@ssi.samsung.com> wrote:
> >> >>-----Original Message-----
> >> >
> >> > So it's almost already "per request_queue"
> >>
> >> Yes, that is because of the following line:
> >>
> >> q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
> >>
> >> in blk_alloc_queue_node().
> >>
> >> Looks like this bio_set doesn't need to be per-request_queue, and
> >> now it is only used for fast-cloning bio for splitting, and one global
> >> split bio_set should be enough.
> >
> > It does have to be per request queue for stacking block devices (which includes
> > loopback).
> 
> In commit df2cb6daa4(block: Avoid deadlocks with bio allocation by
> stacking drivers), deadlock in this situation has been avoided already.
> Or are there other issues with global bio_set? I appreciate if you may
> explain it a bit if there are.

Even with commit df2cb6daa4 there is still risk of deadlocks (even
without low memory condition), see:
https://patchwork.kernel.org/patch/7398411/

(you may recall you blocked this patch with concerns about performance,
context switches, plug merging being compromised, etc.. to which I never
circled back to verify your concerns)

But it illustrates the type of problems that can occur when your rescue
infrastructure is shared across devices (in the context of df2cb6daa4,
current->bio_list contains bios from multiple devices). 

If a single splitting bio_set were shared across devices there would be
no guarantee of forward progress with complex stacked devices (one or
more devices could exhaust the reserve and starve out other devices in
the stack).  So keeping the bio_set per request_queue isn't prone to
failure like a shared bio_set might be.

Mike

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-22 22:58                     ` Kent Overstreet
  2016-02-23  2:55                       ` Ming Lei
@ 2016-02-23 20:45                       ` Pavel Machek
  1 sibling, 0 replies; 33+ messages in thread
From: Pavel Machek @ 2016-02-23 20:45 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Ming Lei, Ming Lin-SSI, Mike Snitzer, kernel list, axboe, hch,
	neilb, martin.petersen, dpark, dm-devel, agk, jkosina, geoff,
	jim, pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger

On Mon 2016-02-22 13:58:18, Kent Overstreet wrote:
> On Sun, Feb 21, 2016 at 05:40:59PM +0800, Ming Lei wrote:
> > On Sun, Feb 21, 2016 at 2:43 PM, Ming Lin-SSI <ming.l@ssi.samsung.com> wrote:
> > >>-----Original Message-----
> > >
> > > So it's almost already "per request_queue"
> > 
> > Yes, that is because of the following line:
> > 
> > q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
> > 
> > in blk_alloc_queue_node().
> > 
> > Looks like this bio_set doesn't need to be per-request_queue, and
> > now it is only used for fast-cloning bio for splitting, and one global
> > split bio_set should be enough.
> 
> It does have to be per request queue for stacking block devices (which includes
> loopback).

Could we only allocate request queues for devices that are not even
opened? I have these in my system:

loop0  loop2  loop4  loop6  md0   nbd1	 nbd11	nbd13  nbd15  nbd3 nbd5  nbd7    nbd9  sda1  sda3
loop1  loop3  loop5  loop7  nbd0  nbd10  nbd12	nbd14  nbd2   nbd4 nbd6  nbd8    sda   sda2  sda4

...but nbd is never used, loop1+ is never used, and loop0 is only used
once in a blue moon. Each process takes 8K+...

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-23 14:54                         ` Mike Snitzer
@ 2016-02-24  2:48                           ` Ming Lei
  2016-02-24  3:23                             ` Kent Overstreet
  0 siblings, 1 reply; 33+ messages in thread
From: Ming Lei @ 2016-02-24  2:48 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Kent Overstreet, oleg.drokin, Ming Lin-SSI, andreas.dilger,
	martin.petersen, minchan, jkosina, kernel list, jim, pjk1939,
	axboe, geoff, dm-devel, dpark, Pavel Machek, ngupta, hch, agk

On Tue, Feb 23, 2016 at 10:54 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> On Mon, Feb 22 2016 at  9:55pm -0500,
> Ming Lei <ming.lei@canonical.com> wrote:
>
>> On Tue, Feb 23, 2016 at 6:58 AM, Kent Overstreet
>> <kent.overstreet@gmail.com> wrote:
>> > On Sun, Feb 21, 2016 at 05:40:59PM +0800, Ming Lei wrote:
>> >> On Sun, Feb 21, 2016 at 2:43 PM, Ming Lin-SSI <ming.l@ssi.samsung.com> wrote:
>> >> >>-----Original Message-----
>> >> >
>> >> > So it's almost already "per request_queue"
>> >>
>> >> Yes, that is because of the following line:
>> >>
>> >> q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
>> >>
>> >> in blk_alloc_queue_node().
>> >>
>> >> Looks like this bio_set doesn't need to be per-request_queue, and
>> >> now it is only used for fast-cloning bio for splitting, and one global
>> >> split bio_set should be enough.
>> >
>> > It does have to be per request queue for stacking block devices (which includes
>> > loopback).
>>
>> In commit df2cb6daa4(block: Avoid deadlocks with bio allocation by
>> stacking drivers), deadlock in this situation has been avoided already.
>> Or are there other issues with global bio_set? I appreciate if you may
>> explain it a bit if there are.
>
> Even with commit df2cb6daa4 there is still risk of deadlocks (even
> without low memory condition), see:
> https://patchwork.kernel.org/patch/7398411/

That is definitely another problem which isn't related with low memory,
and I guess Kent means there might be deadlock risk in case of shared
bio_set.

>
> (you may recall you blocked this patch with concerns about performance,
> context switches, plug merging being compromised, etc.. to which I never
> circled back to verify your concerns)

I still remember that problem:

1) Process A
     - two bio(a, b) are splitted in dm's make_request funtion
     - bio(a) is submitted via generic_make_request(), so it is staged
       in current->bio_list
     - time t1
     - before bio(b) is submitted, down_write(&s->lock) is run and
      never return

2) Process B:
     - just during time t1, wait completion of bio(a) by down_write(&s->lock)

Then Process A waits the lock which is acquired by B first, and the
two bio(a, b)
can't reach to driver/device at all.

Looks that current->bio_list is fragile to locks from make_request function,
and moving the lock into workqueue context should be helpful.

And I am happy to continue to discuss this issue further.

>
> But it illustrates the type of problems that can occur when your rescue
> infrastructure is shared across devices (in the context of df2cb6daa4,
> current->bio_list contains bios from multiple devices).
>
> If a single splitting bio_set were shared across devices there would be
> no guarantee of forward progress with complex stacked devices (one or
> more devices could exhaust the reserve and starve out other devices in
> the stack).  So keeping the bio_set per request_queue isn't prone to
> failure like a shared bio_set might be.

Not consider the dm lock problem, from Kent's commit(df2cb6daa4) log and
the patch, looks forward progress can be guaranteed for stacked devices
with same bio_set, but better to get Kent's clarification.

If forward progress can be guaranteed, percpu mempool might avoid
easy exhausting, because it is reasonable to assume that one CPU can only
provide a certain amount of bandwidth wrt. block transfer.

Thanks
Ming

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: 4.4-final: 28 bioset threads on small notebook
  2016-02-24  2:48                           ` Ming Lei
@ 2016-02-24  3:23                             ` Kent Overstreet
  0 siblings, 0 replies; 33+ messages in thread
From: Kent Overstreet @ 2016-02-24  3:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Mike Snitzer, oleg.drokin, Ming Lin-SSI, andreas.dilger,
	martin.petersen, minchan, jkosina, kernel list, jim, pjk1939,
	axboe, geoff, dm-devel, dpark, Pavel Machek, ngupta, hch, agk

On Wed, Feb 24, 2016 at 10:48:10AM +0800, Ming Lei wrote:
> On Tue, Feb 23, 2016 at 10:54 PM, Mike Snitzer <snitzer@redhat.com> wrote:
> > On Mon, Feb 22 2016 at  9:55pm -0500,
> > Ming Lei <ming.lei@canonical.com> wrote:
> >
> >> On Tue, Feb 23, 2016 at 6:58 AM, Kent Overstreet
> >> <kent.overstreet@gmail.com> wrote:
> >> > On Sun, Feb 21, 2016 at 05:40:59PM +0800, Ming Lei wrote:
> >> >> On Sun, Feb 21, 2016 at 2:43 PM, Ming Lin-SSI <ming.l@ssi.samsung.com> wrote:
> >> >> >>-----Original Message-----
> >> >> >
> >> >> > So it's almost already "per request_queue"
> >> >>
> >> >> Yes, that is because of the following line:
> >> >>
> >> >> q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
> >> >>
> >> >> in blk_alloc_queue_node().
> >> >>
> >> >> Looks like this bio_set doesn't need to be per-request_queue, and
> >> >> now it is only used for fast-cloning bio for splitting, and one global
> >> >> split bio_set should be enough.
> >> >
> >> > It does have to be per request queue for stacking block devices (which includes
> >> > loopback).
> >>
> >> In commit df2cb6daa4(block: Avoid deadlocks with bio allocation by
> >> stacking drivers), deadlock in this situation has been avoided already.
> >> Or are there other issues with global bio_set? I appreciate if you may
> >> explain it a bit if there are.
> >
> > Even with commit df2cb6daa4 there is still risk of deadlocks (even
> > without low memory condition), see:
> > https://patchwork.kernel.org/patch/7398411/
> 
> That is definitely another problem which isn't related with low memory,
> and I guess Kent means there might be deadlock risk in case of shared
> bio_set.
> 
> >
> > (you may recall you blocked this patch with concerns about performance,
> > context switches, plug merging being compromised, etc.. to which I never
> > circled back to verify your concerns)
> 
> I still remember that problem:
> 
> 1) Process A
>      - two bio(a, b) are splitted in dm's make_request funtion
>      - bio(a) is submitted via generic_make_request(), so it is staged
>        in current->bio_list
>      - time t1
>      - before bio(b) is submitted, down_write(&s->lock) is run and
>       never return
> 
> 2) Process B:
>      - just during time t1, wait completion of bio(a) by down_write(&s->lock)
> 
> Then Process A waits the lock which is acquired by B first, and the
> two bio(a, b)
> can't reach to driver/device at all.
> 
> Looks that current->bio_list is fragile to locks from make_request function,
> and moving the lock into workqueue context should be helpful.
> 
> And I am happy to continue to discuss this issue further.
> 
> >
> > But it illustrates the type of problems that can occur when your rescue
> > infrastructure is shared across devices (in the context of df2cb6daa4,
> > current->bio_list contains bios from multiple devices).
> >
> > If a single splitting bio_set were shared across devices there would be
> > no guarantee of forward progress with complex stacked devices (one or
> > more devices could exhaust the reserve and starve out other devices in
> > the stack).  So keeping the bio_set per request_queue isn't prone to
> > failure like a shared bio_set might be.
> 
> Not consider the dm lock problem, from Kent's commit(df2cb6daa4) log and
> the patch, looks forward progress can be guaranteed for stacked devices
> with same bio_set, but better to get Kent's clarification.
> 
> If forward progress can be guaranteed, percpu mempool might avoid
> easy exhausting, because it is reasonable to assume that one CPU can only
> provide a certain amount of bandwidth wrt. block transfer.

Generally speaking, with potential deadlocks like this I don't bother to work
out the specific scenario, it's enough to know that there's a shared resource
and multiple users that depend on each other... if you've got that, you'll have
a deadlock.

But, if you're curious: say we've got block devices a and b, when you submit to
a the bio will get passed down to b:

for the bioset itself: if a bio gets split when submitted to a, then needs to be
split again when it's submitted to b - you're allocating twice from the same
mempool, and the first allocation can't be freed until the original bio
completes. deadlock.

with the rescuer threads it's more subtle, but you just need a scenario where
the rescuer is required twice in a row. I'm not going to bother trying to work
out the details, but it's the same principle - you can end up in a situation
where you're blocked, and you need the rescuer thread to make forward progress
(or you'd deadlock - that's why it exists, right?) - well, what happens if that
happens twice in a row, and the second time you're running out of the rescuer
thread? oops.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2016-02-20 20:04         ` Pavel Machek
  2016-02-20 20:38           ` Mike Snitzer
@ 2017-02-06 12:53           ` Pavel Machek
  2017-02-07  1:47             ` Kent Overstreet
  1 sibling, 1 reply; 33+ messages in thread
From: Pavel Machek @ 2017-02-06 12:53 UTC (permalink / raw)
  To: Mike Snitzer, kent.overstreet
  Cc: kernel list, axboe, hch, kent.overstreet, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger

[-- Attachment #1: Type: text/plain, Size: 1484 bytes --]

On Sat 2016-02-20 21:04:32, Pavel Machek wrote:
> Hi!
> 
> > > > > > I know it is normal to spawn 8 threads for every single function,
> > > > > ...
> > > > > > but 28 threads?
> > > > > > 
> > > > > > root       974  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
> > > > > ...
> > > > > 
> > > > > How many physical block devices do you have?
> > > > > 
> > > > > DM is doing its part to not contribute to this:
> > > > > dbba42d8a ("dm: eliminate unused "bioset" process for each bio-based DM device")
> > > > > 
> > > > > (but yeah, all these extra 'bioset' threads aren't ideal)
> > > > 
> > > > Still there in 4.4-final.
> > > 
> > > ...and still there in 4.5-rc4 :-(.
> > 
> > You're directing this concern to the wrong person.
> > 
> > I already told you DM is _not_ contributing any extra "bioset" threads
> > (ever since commit dbba42d8a).
> 
> Well, sorry about that. Note that l-k is on the cc list, so hopefully
> the right person sees it too.
> 
> Ok, let me check... it seems that 
> 54efd50bfd873e2dbf784e0b21a8027ba4299a3e is responsible, thus Kent
> Overstreet <kent.overstreet@gmail.com> is to blame.
> 
> Um, and you acked the patch, so you are partly responsible.

Still there on v4.9, 36 threads on nokia n900 cellphone.

So.. what needs to be done there?
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2017-02-06 12:53           ` v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone Pavel Machek
@ 2017-02-07  1:47             ` Kent Overstreet
  2017-02-07  2:49               ` Kent Overstreet
  0 siblings, 1 reply; 33+ messages in thread
From: Kent Overstreet @ 2017-02-07  1:47 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Mike Snitzer, kernel list, axboe, hch, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger

On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> On Sat 2016-02-20 21:04:32, Pavel Machek wrote:
> > Hi!
> > 
> > > > > > > I know it is normal to spawn 8 threads for every single function,
> > > > > > ...
> > > > > > > but 28 threads?
> > > > > > > 
> > > > > > > root       974  0.0  0.0      0     0 ?        S<   Dec08   0:00 [bioset]
> > > > > > ...
> > > > > > 
> > > > > > How many physical block devices do you have?
> > > > > > 
> > > > > > DM is doing its part to not contribute to this:
> > > > > > dbba42d8a ("dm: eliminate unused "bioset" process for each bio-based DM device")
> > > > > > 
> > > > > > (but yeah, all these extra 'bioset' threads aren't ideal)
> > > > > 
> > > > > Still there in 4.4-final.
> > > > 
> > > > ...and still there in 4.5-rc4 :-(.
> > > 
> > > You're directing this concern to the wrong person.
> > > 
> > > I already told you DM is _not_ contributing any extra "bioset" threads
> > > (ever since commit dbba42d8a).
> > 
> > Well, sorry about that. Note that l-k is on the cc list, so hopefully
> > the right person sees it too.
> > 
> > Ok, let me check... it seems that 
> > 54efd50bfd873e2dbf784e0b21a8027ba4299a3e is responsible, thus Kent
> > Overstreet <kent.overstreet@gmail.com> is to blame.
> > 
> > Um, and you acked the patch, so you are partly responsible.
> 
> Still there on v4.9, 36 threads on nokia n900 cellphone.
> 
> So.. what needs to be done there?

So background:

We need rescuer threads because:

Say you allocate a bio from a bioset, submit that bio, and then allocate again
from that same bioset: if you're running underneath generic_make_request(),
you'll deadlock. Real submission of the first bio you allocated is blocked until
you return from your make_request_fn(), but you're blocking that trying to
allocate - this is handled (in a hacky way!) with the punt_bios_to_rescuer code
when we go to allocate from a bioset but have to block.

We need more than a single global rescuer, because:

The rescuer thread is just resubmitting bios, so if in the course of submitting
bios, _their_ drivers allocate new bios from biosets and block - oops, we're
recursing.

However:

The rescuer threads don't inherently need to be per bioset - they really ought
to be per block device.

Additionally, triggering the "punt bios to rescuer" code only when we go to
allocate from a bioset and block is wrong: it's possible to create these sorts
of deadlocks by blocking on other things. The right thing to do would be to
trigger this "punt bios to rescuer" thing whenever we schedule, and there's
still bios on current->bio_list.

This is actually how Jens's new(er) plugging code works (which post dates my
punt bios to rescuer hack). What needs to happen is Jens's scheduler hook for
block plugging needs to be be unified with both the current->bio_list thing
(which is really a block plug, just open coded, as it predates _all_ of this)
and the rescuer thread stuff.

The tricky part is going to be making the rescuer threads per block device
_correctly_ (without introducing new deadlocks)... reasoning out these deadlocks
always makes my head hurt, the biggest reason I made the rescuer threads per
bioset was that when I wrote the code I wasn't at all confident I could get
anything else right. Still uneasy about that :)

What needs rescuer threads?

- if we're allocating from a bioset, but we're not running under
  generic_make_request() (e.g. we're in filesystem code) - don't need a rescuer
  there, we're not blocking previously allocated bios from being submitted.

- if we're a block driver that doesn't allocate new bios, we don't need a
  rescuer thread.

  But note that any block driver that calls blk_queue_split() to handle
  arbitrary size bios is allocating bios. If we converted e.g. the standard
  request queue code to process bios/requests incrementally, instead of
  requiring them to be split, this would go away (and that's something that
  should be done anyways, it would improve performance by getting rid of segment
  counting).

So, we should only need one rescuer thread per block device - and if we get rid
of splitting to handle arbitrary size bios, most block devices won't need
rescuers.

The catch is that the correct rescuer to punt a bio to corresponds to the device
that _issued_ the bio, not the device the bio was submitted to, and that
information is currently lost by the time we block - that's the other reason I
made rescuers per bioset, since bios do track the bioset they were allocated
from.

But, I just got an idea for how to handle this that might be halfway sane, maybe
I'll try and come up with a patch...

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2017-02-07  1:47             ` Kent Overstreet
@ 2017-02-07  2:49               ` Kent Overstreet
  2017-02-07 17:13                 ` Mike Snitzer
                                   ` (2 more replies)
  0 siblings, 3 replies; 33+ messages in thread
From: Kent Overstreet @ 2017-02-07  2:49 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Mike Snitzer, kernel list, axboe, hch, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger

On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > 
> > So.. what needs to be done there?

> But, I just got an idea for how to handle this that might be halfway sane, maybe
> I'll try and come up with a patch...

Ok, here's such a patch, only lightly tested:

-- >8 --
Subject: [PATCH] block: Make rescuer threads per request_queue, not per bioset

Note: this patch is very lightly tested.

Also, trigger rescuing whenever with bios on current->bio_list, instead
of only when we block in bio_alloc_bioset(). This is more correct, and
should result in fewer rescuer threads.

XXX: The current->bio_list plugging needs to be unified with the
blk_plug mechanism.

TODO: If we change normal request_queue drivers to handle arbitrary size
bios by processing requests incrementally, instead of splitting bios,
then we can get rid of rescuer threads from those devices.
---
 block/bio.c            | 107 ++++---------------------------------------------
 block/blk-core.c       |  58 ++++++++++++++++++++++++---
 block/blk-sysfs.c      |   2 +
 include/linux/bio.h    |  16 ++++----
 include/linux/blkdev.h |  10 +++++
 include/linux/sched.h  |   2 +-
 kernel/sched/core.c    |   4 ++
 7 files changed, 83 insertions(+), 116 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index f3b5786202..9ad54a9b12 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -336,54 +336,6 @@ void bio_chain(struct bio *bio, struct bio *parent)
 }
 EXPORT_SYMBOL(bio_chain);
 
-static void bio_alloc_rescue(struct work_struct *work)
-{
-	struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
-	struct bio *bio;
-
-	while (1) {
-		spin_lock(&bs->rescue_lock);
-		bio = bio_list_pop(&bs->rescue_list);
-		spin_unlock(&bs->rescue_lock);
-
-		if (!bio)
-			break;
-
-		generic_make_request(bio);
-	}
-}
-
-static void punt_bios_to_rescuer(struct bio_set *bs)
-{
-	struct bio_list punt, nopunt;
-	struct bio *bio;
-
-	/*
-	 * In order to guarantee forward progress we must punt only bios that
-	 * were allocated from this bio_set; otherwise, if there was a bio on
-	 * there for a stacking driver higher up in the stack, processing it
-	 * could require allocating bios from this bio_set, and doing that from
-	 * our own rescuer would be bad.
-	 *
-	 * Since bio lists are singly linked, pop them all instead of trying to
-	 * remove from the middle of the list:
-	 */
-
-	bio_list_init(&punt);
-	bio_list_init(&nopunt);
-
-	while ((bio = bio_list_pop(current->bio_list)))
-		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
-
-	*current->bio_list = nopunt;
-
-	spin_lock(&bs->rescue_lock);
-	bio_list_merge(&bs->rescue_list, &punt);
-	spin_unlock(&bs->rescue_lock);
-
-	queue_work(bs->rescue_workqueue, &bs->rescue_work);
-}
-
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -421,54 +373,27 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
-	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	struct bio_vec *bvl = NULL;
 	struct bio *bio;
 	void *p;
 
-	if (!bs) {
-		if (nr_iovecs > UIO_MAXIOV)
-			return NULL;
+	WARN(current->bio_list &&
+	     !current->bio_list->q->rescue_workqueue,
+	     "allocating bio beneath generic_make_request() without rescuer");
 
+	if (nr_iovecs > UIO_MAXIOV)
+		return NULL;
+
+	if (!bs) {
 		p = kmalloc(sizeof(struct bio) +
 			    nr_iovecs * sizeof(struct bio_vec),
 			    gfp_mask);
 		front_pad = 0;
 		inline_vecs = nr_iovecs;
 	} else {
-		/*
-		 * generic_make_request() converts recursion to iteration; this
-		 * means if we're running beneath it, any bios we allocate and
-		 * submit will not be submitted (and thus freed) until after we
-		 * return.
-		 *
-		 * This exposes us to a potential deadlock if we allocate
-		 * multiple bios from the same bio_set() while running
-		 * underneath generic_make_request(). If we were to allocate
-		 * multiple bios (say a stacking block driver that was splitting
-		 * bios), we would deadlock if we exhausted the mempool's
-		 * reserve.
-		 *
-		 * We solve this, and guarantee forward progress, with a rescuer
-		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-		 * bios we would be blocking to the rescuer workqueue before
-		 * we retry with the original gfp_flags.
-		 */
-
-		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
-
 		p = mempool_alloc(&bs->bio_pool, gfp_mask);
-		if (!p && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			p = mempool_alloc(&bs->bio_pool, gfp_mask);
-		}
-
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
@@ -483,12 +408,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		unsigned long idx = 0;
 
 		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
-		if (!bvl && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
-		}
-
 		if (unlikely(!bvl))
 			goto err_free;
 
@@ -1938,10 +1857,6 @@ int biovec_init_pool(mempool_t *pool, int pool_entries)
 
 void bioset_exit(struct bio_set *bs)
 {
-	if (bs->rescue_workqueue)
-		destroy_workqueue(bs->rescue_workqueue);
-	bs->rescue_workqueue = NULL;
-
 	mempool_exit(&bs->bio_pool);
 	mempool_exit(&bs->bvec_pool);
 
@@ -1968,10 +1883,6 @@ static int __bioset_init(struct bio_set *bs,
 
 	bs->front_pad = front_pad;
 
-	spin_lock_init(&bs->rescue_lock);
-	bio_list_init(&bs->rescue_list);
-	INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
-
 	bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
 	if (!bs->bio_slab)
 		return -ENOMEM;
@@ -1983,10 +1894,6 @@ static int __bioset_init(struct bio_set *bs,
 	    biovec_init_pool(&bs->bvec_pool, pool_size))
 		goto bad;
 
-	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
-	if (!bs->rescue_workqueue)
-		goto bad;
-
 	return 0;
 bad:
 	bioset_exit(bs);
diff --git a/block/blk-core.c b/block/blk-core.c
index 7e3cfa9c88..f716164cb3 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -48,6 +48,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);
 
 DEFINE_IDA(blk_queue_ida);
 
+static void bio_rescue_work(struct work_struct *);
+
 /*
  * For the allocated request tables
  */
@@ -759,11 +761,21 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 				PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
 		goto fail_bdi;
 
-	if (blkcg_init_queue(q))
+	spin_lock_init(&q->rescue_lock);
+	bio_list_init(&q->rescue_list);
+	INIT_WORK(&q->rescue_work, bio_rescue_work);
+
+	q->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
+	if (!q->rescue_workqueue)
 		goto fail_ref;
 
+	if (blkcg_init_queue(q))
+		goto fail_rescue;
+
 	return q;
 
+fail_rescue:
+	destroy_workqueue(q->rescue_workqueue);
 fail_ref:
 	percpu_ref_exit(&q->q_usage_counter);
 fail_bdi:
@@ -1994,7 +2006,7 @@ generic_make_request_checks(struct bio *bio)
  */
 blk_qc_t generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	struct bio_plug_list bio_list_on_stack;
 	blk_qc_t ret = BLK_QC_T_NONE;
 
 	if (!generic_make_request_checks(bio))
@@ -2011,7 +2023,9 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * should be added at the tail
 	 */
 	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+		WARN(!current->bio_list->q->rescue_workqueue,
+		     "submitting bio beneath generic_make_request() without rescuer");
+		bio_list_add(&current->bio_list->bios, bio);
 		goto out;
 	}
 
@@ -2030,19 +2044,23 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * bio_list, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
+	bio_list_init(&bio_list_on_stack.bios);
 	current->bio_list = &bio_list_on_stack;
+
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
+		current->bio_list->q = q;
+
 		if (likely(blk_queue_enter(q, false) == 0)) {
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
 
-			bio = bio_list_pop(current->bio_list);
+			bio = bio_list_pop(&current->bio_list->bios);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
+			struct bio *bio_next =
+				bio_list_pop(&current->bio_list->bios);
 
 			bio_io_error(bio);
 			bio = bio_next;
@@ -2055,6 +2073,34 @@ blk_qc_t generic_make_request(struct bio *bio)
 }
 EXPORT_SYMBOL(generic_make_request);
 
+static void bio_rescue_work(struct work_struct *work)
+{
+	struct request_queue *q =
+		container_of(work, struct request_queue, rescue_work);
+	struct bio *bio;
+
+	while (1) {
+		spin_lock(&q->rescue_lock);
+		bio = bio_list_pop(&q->rescue_list);
+		spin_unlock(&q->rescue_lock);
+
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+}
+
+void blk_punt_blocked_bios(struct bio_plug_list *list)
+{
+	spin_lock(&list->q->rescue_lock);
+	bio_list_merge(&list->q->rescue_list, &list->bios);
+	bio_list_init(&list->bios);
+	spin_unlock(&list->q->rescue_lock);
+
+	queue_work(list->q->rescue_workqueue, &list->q->rescue_work);
+}
+
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The &struct bio which describes the I/O
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 7f27a18cc4..77529238d1 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -660,6 +660,8 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_trace_shutdown(q);
 
+	if (q->rescue_workqueue)
+		destroy_workqueue(q->rescue_workqueue);
 	if (q->bio_split)
 		bioset_free(q->bio_split);
 
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 1ffe8e37ae..87eeec7eda 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -658,6 +658,13 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
 	return bio;
 }
 
+struct bio_plug_list {
+	struct bio_list		bios;
+	struct request_queue	*q;
+};
+
+void blk_punt_blocked_bios(struct bio_plug_list *);
+
 /*
  * Increment chain count for the bio. Make sure the CHAIN flag update
  * is visible before the raised count.
@@ -687,15 +694,6 @@ struct bio_set {
 	mempool_t bio_integrity_pool;
 	mempool_t bvec_integrity_pool;
 #endif
-
-	/*
-	 * Deadlock avoidance for stacking block drivers: see comments in
-	 * bio_alloc_bioset() for details
-	 */
-	spinlock_t		rescue_lock;
-	struct bio_list		rescue_list;
-	struct work_struct	rescue_work;
-	struct workqueue_struct	*rescue_workqueue;
 };
 
 struct biovec_slab {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c47c358ba0..f64b886c65 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -476,6 +476,16 @@ struct request_queue {
 	struct bio_set		*bio_split;
 
 	bool			mq_sysfs_init_done;
+
+	/*
+	 * Deadlock avoidance, to deal with the plugging in
+	 * generic_make_request() that converts recursion to iteration to avoid
+	 * stack overflow:
+	 */
+	spinlock_t		rescue_lock;
+	struct bio_list		rescue_list;
+	struct work_struct	rescue_work;
+	struct workqueue_struct	*rescue_workqueue;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2865d10a28..59df7a1030 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1762,7 +1762,7 @@ struct task_struct {
 	void *journal_info;
 
 /* stacked block device info */
-	struct bio_list *bio_list;
+	struct bio_plug_list *bio_list;
 
 #ifdef CONFIG_BLOCK
 /* stack plugging */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bd39d698cb..23b6290ba1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3439,6 +3439,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
 {
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
+
+	if (tsk->bio_list && !bio_list_empty(&tsk->bio_list->bios))
+		blk_punt_blocked_bios(tsk->bio_list);
+
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2017-02-07  2:49               ` Kent Overstreet
@ 2017-02-07 17:13                 ` Mike Snitzer
  2017-02-07 20:39                 ` Pavel Machek
  2017-02-08  2:47                 ` Ming Lei
  2 siblings, 0 replies; 33+ messages in thread
From: Mike Snitzer @ 2017-02-07 17:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Pavel Machek, oleg.drokin, ming.l, andreas.dilger,
	martin.petersen, minchan, jkosina, ming.lei, kernel list, jim,
	pjk1939, axboe, geoff, dm-devel, dpark, ngupta, hch, agk

On Mon, Feb 06 2017 at  9:49pm -0500,
Kent Overstreet <kent.overstreet@gmail.com> wrote:

> On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > 
> > > So.. what needs to be done there?
> 
> > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > I'll try and come up with a patch...
> 
> Ok, here's such a patch, only lightly tested:
> 
> -- >8 --
> Subject: [PATCH] block: Make rescuer threads per request_queue, not per bioset
> 
> Note: this patch is very lightly tested.
> 
> Also, trigger rescuing whenever with bios on current->bio_list, instead
> of only when we block in bio_alloc_bioset(). This is more correct, and
> should result in fewer rescuer threads.
> 
> XXX: The current->bio_list plugging needs to be unified with the
> blk_plug mechanism.
> 
> TODO: If we change normal request_queue drivers to handle arbitrary size
> bios by processing requests incrementally, instead of splitting bios,
> then we can get rid of rescuer threads from those devices.

Hi Kent,

I really appreciate you working on this further.  Thanks.

As I think you're probably already aware, a long standing issue with the
per bio_set rescuer is this bug (which manifests in dm-snapshot
deadlocks): https://bugzilla.kernel.org/show_bug.cgi?id=119841

Please also see this patch header, from a private branch from a while
ago, that describes the problem in detail:
http://git.kernel.org/cgit/linux/kernel/git/snitzer/linux.git/commit/?h=wip&id=cd2c760b5a609e2aaf3735a7b9503a953535c368

Would welcome your consideration of that BZ as you think further and/or
iterate on this line of work.

Mike

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2017-02-07  2:49               ` Kent Overstreet
  2017-02-07 17:13                 ` Mike Snitzer
@ 2017-02-07 20:39                 ` Pavel Machek
  2017-02-08  3:12                   ` Mike Galbraith
  2017-02-08  4:58                   ` Kent Overstreet
  2017-02-08  2:47                 ` Ming Lei
  2 siblings, 2 replies; 33+ messages in thread
From: Pavel Machek @ 2017-02-07 20:39 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Mike Snitzer, kernel list, axboe, hch, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger

[-- Attachment #1: Type: text/plain, Size: 13883 bytes --]

On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > 
> > > So.. what needs to be done there?
> 
> > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > I'll try and come up with a patch...
> 
> Ok, here's such a patch, only lightly tested:

I guess it would be nice for me to test it... but what it is against?
I tried after v4.10-rc5 and linux-next, but got rejects in both cases.

Thanks,
								Pavel

> -- >8 --
> Subject: [PATCH] block: Make rescuer threads per request_queue, not per bioset
> 
> Note: this patch is very lightly tested.
> 
> Also, trigger rescuing whenever with bios on current->bio_list, instead
> of only when we block in bio_alloc_bioset(). This is more correct, and
> should result in fewer rescuer threads.
> 
> XXX: The current->bio_list plugging needs to be unified with the
> blk_plug mechanism.
> 
> TODO: If we change normal request_queue drivers to handle arbitrary size
> bios by processing requests incrementally, instead of splitting bios,
> then we can get rid of rescuer threads from those devices.
> ---
>  block/bio.c            | 107 ++++---------------------------------------------
>  block/blk-core.c       |  58 ++++++++++++++++++++++++---
>  block/blk-sysfs.c      |   2 +
>  include/linux/bio.h    |  16 ++++----
>  include/linux/blkdev.h |  10 +++++
>  include/linux/sched.h  |   2 +-
>  kernel/sched/core.c    |   4 ++
>  7 files changed, 83 insertions(+), 116 deletions(-)
> 
> diff --git a/block/bio.c b/block/bio.c
> index f3b5786202..9ad54a9b12 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -336,54 +336,6 @@ void bio_chain(struct bio *bio, struct bio *parent)
>  }
>  EXPORT_SYMBOL(bio_chain);
>  
> -static void bio_alloc_rescue(struct work_struct *work)
> -{
> -	struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
> -	struct bio *bio;
> -
> -	while (1) {
> -		spin_lock(&bs->rescue_lock);
> -		bio = bio_list_pop(&bs->rescue_list);
> -		spin_unlock(&bs->rescue_lock);
> -
> -		if (!bio)
> -			break;
> -
> -		generic_make_request(bio);
> -	}
> -}
> -
> -static void punt_bios_to_rescuer(struct bio_set *bs)
> -{
> -	struct bio_list punt, nopunt;
> -	struct bio *bio;
> -
> -	/*
> -	 * In order to guarantee forward progress we must punt only bios that
> -	 * were allocated from this bio_set; otherwise, if there was a bio on
> -	 * there for a stacking driver higher up in the stack, processing it
> -	 * could require allocating bios from this bio_set, and doing that from
> -	 * our own rescuer would be bad.
> -	 *
> -	 * Since bio lists are singly linked, pop them all instead of trying to
> -	 * remove from the middle of the list:
> -	 */
> -
> -	bio_list_init(&punt);
> -	bio_list_init(&nopunt);
> -
> -	while ((bio = bio_list_pop(current->bio_list)))
> -		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> -
> -	*current->bio_list = nopunt;
> -
> -	spin_lock(&bs->rescue_lock);
> -	bio_list_merge(&bs->rescue_list, &punt);
> -	spin_unlock(&bs->rescue_lock);
> -
> -	queue_work(bs->rescue_workqueue, &bs->rescue_work);
> -}
> -
>  /**
>   * bio_alloc_bioset - allocate a bio for I/O
>   * @gfp_mask:   the GFP_ mask given to the slab allocator
> @@ -421,54 +373,27 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   */
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  {
> -	gfp_t saved_gfp = gfp_mask;
>  	unsigned front_pad;
>  	unsigned inline_vecs;
>  	struct bio_vec *bvl = NULL;
>  	struct bio *bio;
>  	void *p;
>  
> -	if (!bs) {
> -		if (nr_iovecs > UIO_MAXIOV)
> -			return NULL;
> +	WARN(current->bio_list &&
> +	     !current->bio_list->q->rescue_workqueue,
> +	     "allocating bio beneath generic_make_request() without rescuer");
>  
> +	if (nr_iovecs > UIO_MAXIOV)
> +		return NULL;
> +
> +	if (!bs) {
>  		p = kmalloc(sizeof(struct bio) +
>  			    nr_iovecs * sizeof(struct bio_vec),
>  			    gfp_mask);
>  		front_pad = 0;
>  		inline_vecs = nr_iovecs;
>  	} else {
> -		/*
> -		 * generic_make_request() converts recursion to iteration; this
> -		 * means if we're running beneath it, any bios we allocate and
> -		 * submit will not be submitted (and thus freed) until after we
> -		 * return.
> -		 *
> -		 * This exposes us to a potential deadlock if we allocate
> -		 * multiple bios from the same bio_set() while running
> -		 * underneath generic_make_request(). If we were to allocate
> -		 * multiple bios (say a stacking block driver that was splitting
> -		 * bios), we would deadlock if we exhausted the mempool's
> -		 * reserve.
> -		 *
> -		 * We solve this, and guarantee forward progress, with a rescuer
> -		 * workqueue per bio_set. If we go to allocate and there are
> -		 * bios on current->bio_list, we first try the allocation
> -		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> -		 * bios we would be blocking to the rescuer workqueue before
> -		 * we retry with the original gfp_flags.
> -		 */
> -
> -		if (current->bio_list && !bio_list_empty(current->bio_list))
> -			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> -
>  		p = mempool_alloc(&bs->bio_pool, gfp_mask);
> -		if (!p && gfp_mask != saved_gfp) {
> -			punt_bios_to_rescuer(bs);
> -			gfp_mask = saved_gfp;
> -			p = mempool_alloc(&bs->bio_pool, gfp_mask);
> -		}
> -
>  		front_pad = bs->front_pad;
>  		inline_vecs = BIO_INLINE_VECS;
>  	}
> @@ -483,12 +408,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		unsigned long idx = 0;
>  
>  		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
> -		if (!bvl && gfp_mask != saved_gfp) {
> -			punt_bios_to_rescuer(bs);
> -			gfp_mask = saved_gfp;
> -			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
> -		}
> -
>  		if (unlikely(!bvl))
>  			goto err_free;
>  
> @@ -1938,10 +1857,6 @@ int biovec_init_pool(mempool_t *pool, int pool_entries)
>  
>  void bioset_exit(struct bio_set *bs)
>  {
> -	if (bs->rescue_workqueue)
> -		destroy_workqueue(bs->rescue_workqueue);
> -	bs->rescue_workqueue = NULL;
> -
>  	mempool_exit(&bs->bio_pool);
>  	mempool_exit(&bs->bvec_pool);
>  
> @@ -1968,10 +1883,6 @@ static int __bioset_init(struct bio_set *bs,
>  
>  	bs->front_pad = front_pad;
>  
> -	spin_lock_init(&bs->rescue_lock);
> -	bio_list_init(&bs->rescue_list);
> -	INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
> -
>  	bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
>  	if (!bs->bio_slab)
>  		return -ENOMEM;
> @@ -1983,10 +1894,6 @@ static int __bioset_init(struct bio_set *bs,
>  	    biovec_init_pool(&bs->bvec_pool, pool_size))
>  		goto bad;
>  
> -	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> -	if (!bs->rescue_workqueue)
> -		goto bad;
> -
>  	return 0;
>  bad:
>  	bioset_exit(bs);
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7e3cfa9c88..f716164cb3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -48,6 +48,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);
>  
>  DEFINE_IDA(blk_queue_ida);
>  
> +static void bio_rescue_work(struct work_struct *);
> +
>  /*
>   * For the allocated request tables
>   */
> @@ -759,11 +761,21 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>  				PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
>  		goto fail_bdi;
>  
> -	if (blkcg_init_queue(q))
> +	spin_lock_init(&q->rescue_lock);
> +	bio_list_init(&q->rescue_list);
> +	INIT_WORK(&q->rescue_work, bio_rescue_work);
> +
> +	q->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> +	if (!q->rescue_workqueue)
>  		goto fail_ref;
>  
> +	if (blkcg_init_queue(q))
> +		goto fail_rescue;
> +
>  	return q;
>  
> +fail_rescue:
> +	destroy_workqueue(q->rescue_workqueue);
>  fail_ref:
>  	percpu_ref_exit(&q->q_usage_counter);
>  fail_bdi:
> @@ -1994,7 +2006,7 @@ generic_make_request_checks(struct bio *bio)
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -	struct bio_list bio_list_on_stack;
> +	struct bio_plug_list bio_list_on_stack;
>  	blk_qc_t ret = BLK_QC_T_NONE;
>  
>  	if (!generic_make_request_checks(bio))
> @@ -2011,7 +2023,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	 * should be added at the tail
>  	 */
>  	if (current->bio_list) {
> -		bio_list_add(current->bio_list, bio);
> +		WARN(!current->bio_list->q->rescue_workqueue,
> +		     "submitting bio beneath generic_make_request() without rescuer");
> +		bio_list_add(&current->bio_list->bios, bio);
>  		goto out;
>  	}
>  
> @@ -2030,19 +2044,23 @@ blk_qc_t generic_make_request(struct bio *bio)
>  	 * bio_list, and call into ->make_request() again.
>  	 */
>  	BUG_ON(bio->bi_next);
> -	bio_list_init(&bio_list_on_stack);
> +	bio_list_init(&bio_list_on_stack.bios);
>  	current->bio_list = &bio_list_on_stack;
> +
>  	do {
>  		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>  
> +		current->bio_list->q = q;
> +
>  		if (likely(blk_queue_enter(q, false) == 0)) {
>  			ret = q->make_request_fn(q, bio);
>  
>  			blk_queue_exit(q);
>  
> -			bio = bio_list_pop(current->bio_list);
> +			bio = bio_list_pop(&current->bio_list->bios);
>  		} else {
> -			struct bio *bio_next = bio_list_pop(current->bio_list);
> +			struct bio *bio_next =
> +				bio_list_pop(&current->bio_list->bios);
>  
>  			bio_io_error(bio);
>  			bio = bio_next;
> @@ -2055,6 +2073,34 @@ blk_qc_t generic_make_request(struct bio *bio)
>  }
>  EXPORT_SYMBOL(generic_make_request);
>  
> +static void bio_rescue_work(struct work_struct *work)
> +{
> +	struct request_queue *q =
> +		container_of(work, struct request_queue, rescue_work);
> +	struct bio *bio;
> +
> +	while (1) {
> +		spin_lock(&q->rescue_lock);
> +		bio = bio_list_pop(&q->rescue_list);
> +		spin_unlock(&q->rescue_lock);
> +
> +		if (!bio)
> +			break;
> +
> +		generic_make_request(bio);
> +	}
> +}
> +
> +void blk_punt_blocked_bios(struct bio_plug_list *list)
> +{
> +	spin_lock(&list->q->rescue_lock);
> +	bio_list_merge(&list->q->rescue_list, &list->bios);
> +	bio_list_init(&list->bios);
> +	spin_unlock(&list->q->rescue_lock);
> +
> +	queue_work(list->q->rescue_workqueue, &list->q->rescue_work);
> +}
> +
>  /**
>   * submit_bio - submit a bio to the block device layer for I/O
>   * @bio: The &struct bio which describes the I/O
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7f27a18cc4..77529238d1 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -660,6 +660,8 @@ static void blk_release_queue(struct kobject *kobj)
>  
>  	blk_trace_shutdown(q);
>  
> +	if (q->rescue_workqueue)
> +		destroy_workqueue(q->rescue_workqueue);
>  	if (q->bio_split)
>  		bioset_free(q->bio_split);
>  
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1ffe8e37ae..87eeec7eda 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -658,6 +658,13 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
>  	return bio;
>  }
>  
> +struct bio_plug_list {
> +	struct bio_list		bios;
> +	struct request_queue	*q;
> +};
> +
> +void blk_punt_blocked_bios(struct bio_plug_list *);
> +
>  /*
>   * Increment chain count for the bio. Make sure the CHAIN flag update
>   * is visible before the raised count.
> @@ -687,15 +694,6 @@ struct bio_set {
>  	mempool_t bio_integrity_pool;
>  	mempool_t bvec_integrity_pool;
>  #endif
> -
> -	/*
> -	 * Deadlock avoidance for stacking block drivers: see comments in
> -	 * bio_alloc_bioset() for details
> -	 */
> -	spinlock_t		rescue_lock;
> -	struct bio_list		rescue_list;
> -	struct work_struct	rescue_work;
> -	struct workqueue_struct	*rescue_workqueue;
>  };
>  
>  struct biovec_slab {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c47c358ba0..f64b886c65 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -476,6 +476,16 @@ struct request_queue {
>  	struct bio_set		*bio_split;
>  
>  	bool			mq_sysfs_init_done;
> +
> +	/*
> +	 * Deadlock avoidance, to deal with the plugging in
> +	 * generic_make_request() that converts recursion to iteration to avoid
> +	 * stack overflow:
> +	 */
> +	spinlock_t		rescue_lock;
> +	struct bio_list		rescue_list;
> +	struct work_struct	rescue_work;
> +	struct workqueue_struct	*rescue_workqueue;
>  };
>  
>  #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2865d10a28..59df7a1030 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1762,7 +1762,7 @@ struct task_struct {
>  	void *journal_info;
>  
>  /* stacked block device info */
> -	struct bio_list *bio_list;
> +	struct bio_plug_list *bio_list;
>  
>  #ifdef CONFIG_BLOCK
>  /* stack plugging */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bd39d698cb..23b6290ba1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3439,6 +3439,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
>  {
>  	if (!tsk->state || tsk_is_pi_blocked(tsk))
>  		return;
> +
> +	if (tsk->bio_list && !bio_list_empty(&tsk->bio_list->bios))
> +		blk_punt_blocked_bios(tsk->bio_list);
> +
>  	/*
>  	 * If we are going to sleep and we have plugged IO queued,
>  	 * make sure to submit it to avoid deadlocks.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2017-02-07  2:49               ` Kent Overstreet
  2017-02-07 17:13                 ` Mike Snitzer
  2017-02-07 20:39                 ` Pavel Machek
@ 2017-02-08  2:47                 ` Ming Lei
  2 siblings, 0 replies; 33+ messages in thread
From: Ming Lei @ 2017-02-08  2:47 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Pavel Machek, Mike Snitzer, kernel list, Jens Axboe,
	Christoph Hellwig, NeilBrown, Martin K. Petersen, Dongsu Park,
	Ming Lin, open list:DEVICE-MAPPER (LVM),
	Alasdair Kergon, Jiri Kosina, Geoff Levand, Jim Paris,
	Philip Kelleher, Minchan Kim, Nitin Gupta, Oleg Drokin,
	Andreas Dilger

On Tue, Feb 7, 2017 at 10:49 AM, Kent Overstreet
<kent.overstreet@gmail.com> wrote:
> On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
>> On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
>> > Still there on v4.9, 36 threads on nokia n900 cellphone.
>> >
>> > So.. what needs to be done there?
>
>> But, I just got an idea for how to handle this that might be halfway sane, maybe
>> I'll try and come up with a patch...
>
> Ok, here's such a patch, only lightly tested:
>
> -- >8 --
> Subject: [PATCH] block: Make rescuer threads per request_queue, not per bioset
>
> Note: this patch is very lightly tested.
>
> Also, trigger rescuing whenever with bios on current->bio_list, instead
> of only when we block in bio_alloc_bioset(). This is more correct, and
> should result in fewer rescuer threads.

Looks the rescuer stuff gets simplified much with this patch.

>
> XXX: The current->bio_list plugging needs to be unified with the
> blk_plug mechanism.

Yeah, that can be another benefit, :-)

>
> TODO: If we change normal request_queue drivers to handle arbitrary size
> bios by processing requests incrementally, instead of splitting bios,
> then we can get rid of rescuer threads from those devices.

Also the rescue threads are often from some reserved block devices, such as
loop/nbd, and we should have allowed these drivers to delay allocating
the thread
just before the disk is activated. Then the thread number can get descreased
a lot.

> ---
>  block/bio.c            | 107 ++++---------------------------------------------
>  block/blk-core.c       |  58 ++++++++++++++++++++++++---
>  block/blk-sysfs.c      |   2 +
>  include/linux/bio.h    |  16 ++++----
>  include/linux/blkdev.h |  10 +++++
>  include/linux/sched.h  |   2 +-
>  kernel/sched/core.c    |   4 ++
>  7 files changed, 83 insertions(+), 116 deletions(-)
>
> diff --git a/block/bio.c b/block/bio.c
> index f3b5786202..9ad54a9b12 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -336,54 +336,6 @@ void bio_chain(struct bio *bio, struct bio *parent)
>  }
>  EXPORT_SYMBOL(bio_chain);
>
> -static void bio_alloc_rescue(struct work_struct *work)
> -{
> -       struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
> -       struct bio *bio;
> -
> -       while (1) {
> -               spin_lock(&bs->rescue_lock);
> -               bio = bio_list_pop(&bs->rescue_list);
> -               spin_unlock(&bs->rescue_lock);
> -
> -               if (!bio)
> -                       break;
> -
> -               generic_make_request(bio);
> -       }
> -}
> -
> -static void punt_bios_to_rescuer(struct bio_set *bs)
> -{
> -       struct bio_list punt, nopunt;
> -       struct bio *bio;
> -
> -       /*
> -        * In order to guarantee forward progress we must punt only bios that
> -        * were allocated from this bio_set; otherwise, if there was a bio on
> -        * there for a stacking driver higher up in the stack, processing it
> -        * could require allocating bios from this bio_set, and doing that from
> -        * our own rescuer would be bad.
> -        *
> -        * Since bio lists are singly linked, pop them all instead of trying to
> -        * remove from the middle of the list:
> -        */
> -
> -       bio_list_init(&punt);
> -       bio_list_init(&nopunt);
> -
> -       while ((bio = bio_list_pop(current->bio_list)))
> -               bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
> -
> -       *current->bio_list = nopunt;
> -
> -       spin_lock(&bs->rescue_lock);
> -       bio_list_merge(&bs->rescue_list, &punt);
> -       spin_unlock(&bs->rescue_lock);
> -
> -       queue_work(bs->rescue_workqueue, &bs->rescue_work);
> -}
> -
>  /**
>   * bio_alloc_bioset - allocate a bio for I/O
>   * @gfp_mask:   the GFP_ mask given to the slab allocator
> @@ -421,54 +373,27 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   */
>  struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  {
> -       gfp_t saved_gfp = gfp_mask;
>         unsigned front_pad;
>         unsigned inline_vecs;
>         struct bio_vec *bvl = NULL;
>         struct bio *bio;
>         void *p;
>
> -       if (!bs) {
> -               if (nr_iovecs > UIO_MAXIOV)
> -                       return NULL;
> +       WARN(current->bio_list &&
> +            !current->bio_list->q->rescue_workqueue,
> +            "allocating bio beneath generic_make_request() without rescuer");
>
> +       if (nr_iovecs > UIO_MAXIOV)
> +               return NULL;
> +
> +       if (!bs) {
>                 p = kmalloc(sizeof(struct bio) +
>                             nr_iovecs * sizeof(struct bio_vec),
>                             gfp_mask);
>                 front_pad = 0;
>                 inline_vecs = nr_iovecs;
>         } else {
> -               /*
> -                * generic_make_request() converts recursion to iteration; this
> -                * means if we're running beneath it, any bios we allocate and
> -                * submit will not be submitted (and thus freed) until after we
> -                * return.
> -                *
> -                * This exposes us to a potential deadlock if we allocate
> -                * multiple bios from the same bio_set() while running
> -                * underneath generic_make_request(). If we were to allocate
> -                * multiple bios (say a stacking block driver that was splitting
> -                * bios), we would deadlock if we exhausted the mempool's
> -                * reserve.
> -                *
> -                * We solve this, and guarantee forward progress, with a rescuer
> -                * workqueue per bio_set. If we go to allocate and there are
> -                * bios on current->bio_list, we first try the allocation
> -                * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> -                * bios we would be blocking to the rescuer workqueue before
> -                * we retry with the original gfp_flags.
> -                */
> -
> -               if (current->bio_list && !bio_list_empty(current->bio_list))
> -                       gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> -
>                 p = mempool_alloc(&bs->bio_pool, gfp_mask);
> -               if (!p && gfp_mask != saved_gfp) {
> -                       punt_bios_to_rescuer(bs);
> -                       gfp_mask = saved_gfp;
> -                       p = mempool_alloc(&bs->bio_pool, gfp_mask);
> -               }
> -
>                 front_pad = bs->front_pad;
>                 inline_vecs = BIO_INLINE_VECS;
>         }
> @@ -483,12 +408,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>                 unsigned long idx = 0;
>
>                 bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
> -               if (!bvl && gfp_mask != saved_gfp) {
> -                       punt_bios_to_rescuer(bs);
> -                       gfp_mask = saved_gfp;
> -                       bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, &bs->bvec_pool);
> -               }
> -
>                 if (unlikely(!bvl))
>                         goto err_free;
>
> @@ -1938,10 +1857,6 @@ int biovec_init_pool(mempool_t *pool, int pool_entries)
>
>  void bioset_exit(struct bio_set *bs)
>  {
> -       if (bs->rescue_workqueue)
> -               destroy_workqueue(bs->rescue_workqueue);
> -       bs->rescue_workqueue = NULL;
> -
>         mempool_exit(&bs->bio_pool);
>         mempool_exit(&bs->bvec_pool);
>
> @@ -1968,10 +1883,6 @@ static int __bioset_init(struct bio_set *bs,
>
>         bs->front_pad = front_pad;
>
> -       spin_lock_init(&bs->rescue_lock);
> -       bio_list_init(&bs->rescue_list);
> -       INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
> -
>         bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
>         if (!bs->bio_slab)
>                 return -ENOMEM;
> @@ -1983,10 +1894,6 @@ static int __bioset_init(struct bio_set *bs,
>             biovec_init_pool(&bs->bvec_pool, pool_size))
>                 goto bad;
>
> -       bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> -       if (!bs->rescue_workqueue)
> -               goto bad;
> -
>         return 0;
>  bad:
>         bioset_exit(bs);
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 7e3cfa9c88..f716164cb3 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -48,6 +48,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);
>
>  DEFINE_IDA(blk_queue_ida);
>
> +static void bio_rescue_work(struct work_struct *);
> +
>  /*
>   * For the allocated request tables
>   */
> @@ -759,11 +761,21 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>                                 PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
>                 goto fail_bdi;
>
> -       if (blkcg_init_queue(q))
> +       spin_lock_init(&q->rescue_lock);
> +       bio_list_init(&q->rescue_list);
> +       INIT_WORK(&q->rescue_work, bio_rescue_work);
> +
> +       q->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
> +       if (!q->rescue_workqueue)
>                 goto fail_ref;
>
> +       if (blkcg_init_queue(q))
> +               goto fail_rescue;
> +
>         return q;
>
> +fail_rescue:
> +       destroy_workqueue(q->rescue_workqueue);
>  fail_ref:
>         percpu_ref_exit(&q->q_usage_counter);
>  fail_bdi:
> @@ -1994,7 +2006,7 @@ generic_make_request_checks(struct bio *bio)
>   */
>  blk_qc_t generic_make_request(struct bio *bio)
>  {
> -       struct bio_list bio_list_on_stack;
> +       struct bio_plug_list bio_list_on_stack;
>         blk_qc_t ret = BLK_QC_T_NONE;
>
>         if (!generic_make_request_checks(bio))
> @@ -2011,7 +2023,9 @@ blk_qc_t generic_make_request(struct bio *bio)
>          * should be added at the tail
>          */
>         if (current->bio_list) {
> -               bio_list_add(current->bio_list, bio);
> +               WARN(!current->bio_list->q->rescue_workqueue,
> +                    "submitting bio beneath generic_make_request() without rescuer");
> +               bio_list_add(&current->bio_list->bios, bio);
>                 goto out;
>         }
>
> @@ -2030,19 +2044,23 @@ blk_qc_t generic_make_request(struct bio *bio)
>          * bio_list, and call into ->make_request() again.
>          */
>         BUG_ON(bio->bi_next);
> -       bio_list_init(&bio_list_on_stack);
> +       bio_list_init(&bio_list_on_stack.bios);
>         current->bio_list = &bio_list_on_stack;
> +
>         do {
>                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> +               current->bio_list->q = q;
> +
>                 if (likely(blk_queue_enter(q, false) == 0)) {
>                         ret = q->make_request_fn(q, bio);
>
>                         blk_queue_exit(q);
>
> -                       bio = bio_list_pop(current->bio_list);
> +                       bio = bio_list_pop(&current->bio_list->bios);
>                 } else {
> -                       struct bio *bio_next = bio_list_pop(current->bio_list);
> +                       struct bio *bio_next =
> +                               bio_list_pop(&current->bio_list->bios);
>
>                         bio_io_error(bio);
>                         bio = bio_next;
> @@ -2055,6 +2073,34 @@ blk_qc_t generic_make_request(struct bio *bio)
>  }
>  EXPORT_SYMBOL(generic_make_request);
>
> +static void bio_rescue_work(struct work_struct *work)
> +{
> +       struct request_queue *q =
> +               container_of(work, struct request_queue, rescue_work);
> +       struct bio *bio;
> +
> +       while (1) {
> +               spin_lock(&q->rescue_lock);
> +               bio = bio_list_pop(&q->rescue_list);
> +               spin_unlock(&q->rescue_lock);
> +
> +               if (!bio)
> +                       break;
> +
> +               generic_make_request(bio);
> +       }
> +}
> +
> +void blk_punt_blocked_bios(struct bio_plug_list *list)
> +{
> +       spin_lock(&list->q->rescue_lock);
> +       bio_list_merge(&list->q->rescue_list, &list->bios);
> +       bio_list_init(&list->bios);
> +       spin_unlock(&list->q->rescue_lock);
> +
> +       queue_work(list->q->rescue_workqueue, &list->q->rescue_work);
> +}

I guess we may need to move the bios into its own queue's rescue list,
otherwise it still may deadlock in the following situation:

for example of md:

generic_make_request(bio_m0, md_q)
    ->.make_request(md_q)
        ->bio_l = mempool_alloc(md_pool)
            ->bio_l->q = ll_q
            ->generic_make_request(bio_l, ll_q)

    ....
    mempool_alloc reclaim is triggered when allocate a new bio for low level:
        -> suppose current bio_list has bio_m1, bio_l0, bio_l1,
        -> all are queued into md_q's rescue_list, and handled in md_q's rescue
            wq context
        -> bio_m1 is handled first, and mempool_alloc() reclaimed too
        -> when bio_l0 is handled in generic_make_request(), the rescue_list
        is switched to ll_q's(this rescue list can be empty), then
nothing can move on

> +
>  /**
>   * submit_bio - submit a bio to the block device layer for I/O
>   * @bio: The &struct bio which describes the I/O
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7f27a18cc4..77529238d1 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -660,6 +660,8 @@ static void blk_release_queue(struct kobject *kobj)
>
>         blk_trace_shutdown(q);
>
> +       if (q->rescue_workqueue)
> +               destroy_workqueue(q->rescue_workqueue);
>         if (q->bio_split)
>                 bioset_free(q->bio_split);
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 1ffe8e37ae..87eeec7eda 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -658,6 +658,13 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
>         return bio;
>  }
>
> +struct bio_plug_list {
> +       struct bio_list         bios;
> +       struct request_queue    *q;
> +};
> +
> +void blk_punt_blocked_bios(struct bio_plug_list *);
> +
>  /*
>   * Increment chain count for the bio. Make sure the CHAIN flag update
>   * is visible before the raised count.
> @@ -687,15 +694,6 @@ struct bio_set {
>         mempool_t bio_integrity_pool;
>         mempool_t bvec_integrity_pool;
>  #endif
> -
> -       /*
> -        * Deadlock avoidance for stacking block drivers: see comments in
> -        * bio_alloc_bioset() for details
> -        */
> -       spinlock_t              rescue_lock;
> -       struct bio_list         rescue_list;
> -       struct work_struct      rescue_work;
> -       struct workqueue_struct *rescue_workqueue;
>  };
>
>  struct biovec_slab {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index c47c358ba0..f64b886c65 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -476,6 +476,16 @@ struct request_queue {
>         struct bio_set          *bio_split;
>
>         bool                    mq_sysfs_init_done;
> +
> +       /*
> +        * Deadlock avoidance, to deal with the plugging in
> +        * generic_make_request() that converts recursion to iteration to avoid
> +        * stack overflow:
> +        */
> +       spinlock_t              rescue_lock;
> +       struct bio_list         rescue_list;
> +       struct work_struct      rescue_work;
> +       struct workqueue_struct *rescue_workqueue;
>  };
>
>  #define QUEUE_FLAG_QUEUED      1       /* uses generic tag queueing */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 2865d10a28..59df7a1030 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1762,7 +1762,7 @@ struct task_struct {
>         void *journal_info;
>
>  /* stacked block device info */
> -       struct bio_list *bio_list;
> +       struct bio_plug_list *bio_list;
>
>  #ifdef CONFIG_BLOCK
>  /* stack plugging */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bd39d698cb..23b6290ba1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3439,6 +3439,10 @@ static inline void sched_submit_work(struct task_struct *tsk)
>  {
>         if (!tsk->state || tsk_is_pi_blocked(tsk))
>                 return;
> +
> +       if (tsk->bio_list && !bio_list_empty(&tsk->bio_list->bios))
> +               blk_punt_blocked_bios(tsk->bio_list);
> +
>         /*
>          * If we are going to sleep and we have plugged IO queued,
>          * make sure to submit it to avoid deadlocks.
> --
> 2.11.0
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2017-02-07 20:39                 ` Pavel Machek
@ 2017-02-08  3:12                   ` Mike Galbraith
  2017-02-08  4:58                   ` Kent Overstreet
  1 sibling, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2017-02-08  3:12 UTC (permalink / raw)
  To: Pavel Machek, Kent Overstreet
  Cc: Mike Snitzer, kernel list, axboe, hch, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger

On Tue, 2017-02-07 at 21:39 +0100, Pavel Machek wrote:
> On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > 
> > > > So.. what needs to be done there?
> > 
> > > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > > I'll try and come up with a patch...
> > 
> > Ok, here's such a patch, only lightly tested:
> 
> I guess it would be nice for me to test it... but what it is against?
> I tried after v4.10-rc5 and linux-next, but got rejects in both cases.

It wedged into master easily enough (box still seems to work.. but I'll
be rebooting in a very few seconds just in case:), but threads on my
desktop box only dropped from 73 to 71.  Poo.

	-Mike

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2017-02-07 20:39                 ` Pavel Machek
  2017-02-08  3:12                   ` Mike Galbraith
@ 2017-02-08  4:58                   ` Kent Overstreet
  2017-02-08  6:22                     ` [PATCH] block: Make rescuer threads per request_queue, not per bioset kbuild test robot
                                       ` (3 more replies)
  1 sibling, 4 replies; 33+ messages in thread
From: Kent Overstreet @ 2017-02-08  4:58 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Mike Snitzer, kernel list, axboe, hch, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger

On Tue, Feb 07, 2017 at 09:39:11PM +0100, Pavel Machek wrote:
> On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > 
> > > > So.. what needs to be done there?
> > 
> > > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > > I'll try and come up with a patch...
> > 
> > Ok, here's such a patch, only lightly tested:
> 
> I guess it would be nice for me to test it... but what it is against?
> I tried after v4.10-rc5 and linux-next, but got rejects in both cases.

Sorry, I forgot I had a few other patches in my branch that touch
mempool/biosets code.

Also, after thinking about it more and looking at the relevant code, I'm pretty
sure we don't need rescuer threads for block devices that just split bios - i.e.
most of them, so I changed my patch to do that.

Tested it by ripping out the current->bio_list checks/workarounds from the
bcache code, appears to work:

-- >8 --
Subject: [PATCH] block: Make rescuer threads per request_queue, not per bioset

Also, trigger rescuing whenever with bios on current->bio_list, instead
of only when we block in bio_alloc_bioset(). This is more correct, and
should result in fewer rescuer threads.

XXX: The current->bio_list plugging needs to be unified with the
blk_plug mechanism.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 block/bio.c                    | 105 +++--------------------------------------
 block/blk-core.c               |  69 +++++++++++++++++++++++----
 block/blk-mq.c                 |   3 +-
 block/blk-sysfs.c              |   2 +
 drivers/block/brd.c            |   2 +-
 drivers/block/drbd/drbd_main.c |   2 +-
 drivers/block/null_blk.c       |   3 +-
 drivers/block/pktcdvd.c        |   2 +-
 drivers/block/ps3vram.c        |   2 +-
 drivers/block/rsxx/dev.c       |   2 +-
 drivers/block/umem.c           |   2 +-
 drivers/block/zram/zram_drv.c  |   2 +-
 drivers/lightnvm/gennvm.c      |   2 +-
 drivers/md/bcache/super.c      |   2 +-
 drivers/md/dm.c                |   2 +-
 drivers/md/md.c                |   2 +-
 drivers/nvdimm/blk.c           |   2 +-
 drivers/nvdimm/btt.c           |   2 +-
 drivers/nvdimm/pmem.c          |   3 +-
 drivers/s390/block/dcssblk.c   |   2 +-
 drivers/s390/block/xpram.c     |   2 +-
 include/linux/bio.h            |  16 +++----
 include/linux/blkdev.h         |  16 ++++++-
 include/linux/sched.h          |   2 +-
 kernel/sched/core.c            |   6 +++
 25 files changed, 117 insertions(+), 138 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 2b375020fc..9b89be1719 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -340,54 +340,6 @@ void bio_chain(struct bio *bio, struct bio *parent)
 }
 EXPORT_SYMBOL(bio_chain);
 
-static void bio_alloc_rescue(struct work_struct *work)
-{
-	struct bio_set *bs = container_of(work, struct bio_set, rescue_work);
-	struct bio *bio;
-
-	while (1) {
-		spin_lock(&bs->rescue_lock);
-		bio = bio_list_pop(&bs->rescue_list);
-		spin_unlock(&bs->rescue_lock);
-
-		if (!bio)
-			break;
-
-		generic_make_request(bio);
-	}
-}
-
-static void punt_bios_to_rescuer(struct bio_set *bs)
-{
-	struct bio_list punt, nopunt;
-	struct bio *bio;
-
-	/*
-	 * In order to guarantee forward progress we must punt only bios that
-	 * were allocated from this bio_set; otherwise, if there was a bio on
-	 * there for a stacking driver higher up in the stack, processing it
-	 * could require allocating bios from this bio_set, and doing that from
-	 * our own rescuer would be bad.
-	 *
-	 * Since bio lists are singly linked, pop them all instead of trying to
-	 * remove from the middle of the list:
-	 */
-
-	bio_list_init(&punt);
-	bio_list_init(&nopunt);
-
-	while ((bio = bio_list_pop(current->bio_list)))
-		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
-
-	*current->bio_list = nopunt;
-
-	spin_lock(&bs->rescue_lock);
-	bio_list_merge(&bs->rescue_list, &punt);
-	spin_unlock(&bs->rescue_lock);
-
-	queue_work(bs->rescue_workqueue, &bs->rescue_work);
-}
-
 /**
  * bio_alloc_bioset - allocate a bio for I/O
  * @gfp_mask:   the GFP_ mask given to the slab allocator
@@ -425,17 +377,20 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
-	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	struct bio_vec *bvl = NULL;
 	struct bio *bio;
 	void *p;
 
-	if (!bs) {
-		if (nr_iovecs > UIO_MAXIOV)
-			return NULL;
+	WARN(current->bio_list &&
+	     !current->bio_list->q->rescue_workqueue,
+	     "allocating bio beneath generic_make_request() without rescuer");
 
+	if (nr_iovecs > UIO_MAXIOV)
+		return NULL;
+
+	if (!bs) {
 		p = kmalloc(sizeof(struct bio) +
 			    nr_iovecs * sizeof(struct bio_vec),
 			    gfp_mask);
@@ -445,37 +400,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		/* should not use nobvec bioset for nr_iovecs > 0 */
 		if (WARN_ON_ONCE(!bs->bvec_pool && nr_iovecs > 0))
 			return NULL;
-		/*
-		 * generic_make_request() converts recursion to iteration; this
-		 * means if we're running beneath it, any bios we allocate and
-		 * submit will not be submitted (and thus freed) until after we
-		 * return.
-		 *
-		 * This exposes us to a potential deadlock if we allocate
-		 * multiple bios from the same bio_set() while running
-		 * underneath generic_make_request(). If we were to allocate
-		 * multiple bios (say a stacking block driver that was splitting
-		 * bios), we would deadlock if we exhausted the mempool's
-		 * reserve.
-		 *
-		 * We solve this, and guarantee forward progress, with a rescuer
-		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-		 * bios we would be blocking to the rescuer workqueue before
-		 * we retry with the original gfp_flags.
-		 */
-
-		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
-		if (!p && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			p = mempool_alloc(bs->bio_pool, gfp_mask);
-		}
-
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
@@ -490,12 +416,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		unsigned long idx = 0;
 
 		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		if (!bvl && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		}
-
 		if (unlikely(!bvl))
 			goto err_free;
 
@@ -1892,9 +1812,6 @@ mempool_t *biovec_create_pool(int pool_entries)
 
 void bioset_free(struct bio_set *bs)
 {
-	if (bs->rescue_workqueue)
-		destroy_workqueue(bs->rescue_workqueue);
-
 	if (bs->bio_pool)
 		mempool_destroy(bs->bio_pool);
 
@@ -1921,10 +1838,6 @@ static struct bio_set *__bioset_create(unsigned int pool_size,
 
 	bs->front_pad = front_pad;
 
-	spin_lock_init(&bs->rescue_lock);
-	bio_list_init(&bs->rescue_list);
-	INIT_WORK(&bs->rescue_work, bio_alloc_rescue);
-
 	bs->bio_slab = bio_find_or_create_slab(front_pad + back_pad);
 	if (!bs->bio_slab) {
 		kfree(bs);
@@ -1941,10 +1854,6 @@ static struct bio_set *__bioset_create(unsigned int pool_size,
 			goto bad;
 	}
 
-	bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
-	if (!bs->rescue_workqueue)
-		goto bad;
-
 	return bs;
 bad:
 	bioset_free(bs);
diff --git a/block/blk-core.c b/block/blk-core.c
index 61ba08c58b..2222fd40e2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -49,6 +49,8 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug);
 
 DEFINE_IDA(blk_queue_ida);
 
+static void bio_rescue_work(struct work_struct *);
+
 /*
  * For the allocated request tables
  */
@@ -643,9 +645,9 @@ void blk_exit_rl(struct request_list *rl)
 		mempool_destroy(rl->rq_pool);
 }
 
-struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
+struct request_queue *blk_alloc_queue(gfp_t gfp_mask, int flags)
 {
-	return blk_alloc_queue_node(gfp_mask, NUMA_NO_NODE);
+	return blk_alloc_queue_node(gfp_mask, NUMA_NO_NODE, flags);
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
@@ -690,7 +692,7 @@ static void blk_rq_timed_out_timer(unsigned long data)
 	kblockd_schedule_work(&q->timeout_work);
 }
 
-struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
+struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id, int flags)
 {
 	struct request_queue *q;
 	int err;
@@ -760,11 +762,23 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 				PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
 		goto fail_bdi;
 
+	spin_lock_init(&q->rescue_lock);
+	bio_list_init(&q->rescue_list);
+	INIT_WORK(&q->rescue_work, bio_rescue_work);
+
+	if (!(flags & BLK_QUEUE_NO_RESCUER)) {
+		q->rescue_workqueue = alloc_workqueue("rescue", WQ_MEM_RECLAIM, 0);
+		if (!q->rescue_workqueue)
+			goto fail_ref;
+	}
+
 	if (blkcg_init_queue(q))
-		goto fail_ref;
+		goto fail_rescue;
 
 	return q;
 
+fail_rescue:
+	destroy_workqueue(q->rescue_workqueue);
 fail_ref:
 	percpu_ref_exit(&q->q_usage_counter);
 fail_bdi:
@@ -823,7 +837,8 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
 {
 	struct request_queue *uninit_q, *q;
 
-	uninit_q = blk_alloc_queue_node(GFP_KERNEL, node_id);
+	uninit_q = blk_alloc_queue_node(GFP_KERNEL, node_id,
+					BLK_QUEUE_NO_RESCUER);
 	if (!uninit_q)
 		return NULL;
 
@@ -1977,7 +1992,7 @@ generic_make_request_checks(struct bio *bio)
  */
 blk_qc_t generic_make_request(struct bio *bio)
 {
-	struct bio_list bio_list_on_stack;
+	struct bio_plug_list bio_list_on_stack;
 	blk_qc_t ret = BLK_QC_T_NONE;
 
 	if (!generic_make_request_checks(bio))
@@ -1994,7 +2009,9 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * should be added at the tail
 	 */
 	if (current->bio_list) {
-		bio_list_add(current->bio_list, bio);
+		WARN(!current->bio_list->q->rescue_workqueue,
+		     "submitting bio beneath generic_make_request() without rescuer");
+		bio_list_add(&current->bio_list->bios, bio);
 		goto out;
 	}
 
@@ -2013,19 +2030,23 @@ blk_qc_t generic_make_request(struct bio *bio)
 	 * bio_list, and call into ->make_request() again.
 	 */
 	BUG_ON(bio->bi_next);
-	bio_list_init(&bio_list_on_stack);
+	bio_list_init(&bio_list_on_stack.bios);
 	current->bio_list = &bio_list_on_stack;
+
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
+		current->bio_list->q = q;
+
 		if (likely(blk_queue_enter(q, false) == 0)) {
 			ret = q->make_request_fn(q, bio);
 
 			blk_queue_exit(q);
 
-			bio = bio_list_pop(current->bio_list);
+			bio = bio_list_pop(&current->bio_list->bios);
 		} else {
-			struct bio *bio_next = bio_list_pop(current->bio_list);
+			struct bio *bio_next =
+				bio_list_pop(&current->bio_list->bios);
 
 			bio_io_error(bio);
 			bio = bio_next;
@@ -2038,6 +2059,34 @@ blk_qc_t generic_make_request(struct bio *bio)
 }
 EXPORT_SYMBOL(generic_make_request);
 
+static void bio_rescue_work(struct work_struct *work)
+{
+	struct request_queue *q =
+		container_of(work, struct request_queue, rescue_work);
+	struct bio *bio;
+
+	while (1) {
+		spin_lock(&q->rescue_lock);
+		bio = bio_list_pop(&q->rescue_list);
+		spin_unlock(&q->rescue_lock);
+
+		if (!bio)
+			break;
+
+		generic_make_request(bio);
+	}
+}
+
+void blk_punt_blocked_bios(struct bio_plug_list *list)
+{
+	spin_lock(&list->q->rescue_lock);
+	bio_list_merge(&list->q->rescue_list, &list->bios);
+	bio_list_init(&list->bios);
+	spin_unlock(&list->q->rescue_lock);
+
+	queue_work(list->q->rescue_workqueue, &list->q->rescue_work);
+}
+
 /**
  * submit_bio - submit a bio to the block device layer for I/O
  * @bio: The &struct bio which describes the I/O
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c3400b5444..5e7f67c108 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2043,7 +2043,8 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
 {
 	struct request_queue *uninit_q, *q;
 
-	uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node);
+	uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node,
+					BLK_QUEUE_NO_RESCUER);
 	if (!uninit_q)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1dbce05759..1ab82c342b 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -821,6 +821,8 @@ static void blk_release_queue(struct kobject *kobj)
 
 	blk_trace_shutdown(q);
 
+	if (q->rescue_workqueue)
+		destroy_workqueue(q->rescue_workqueue);
 	if (q->bio_split)
 		bioset_free(q->bio_split);
 
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 3adc32a315..43ff4b23e4 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -449,7 +449,7 @@ static struct brd_device *brd_alloc(int i)
 	spin_lock_init(&brd->brd_lock);
 	INIT_RADIX_TREE(&brd->brd_pages, GFP_ATOMIC);
 
-	brd->brd_queue = blk_alloc_queue(GFP_KERNEL);
+	brd->brd_queue = blk_alloc_queue(GFP_KERNEL, BLK_QUEUE_NO_RESCUER);
 	if (!brd->brd_queue)
 		goto out_free_dev;
 
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 83482721bc..e46821ebc6 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2810,7 +2810,7 @@ enum drbd_ret_code drbd_create_device(struct drbd_config_context *adm_ctx, unsig
 
 	drbd_init_set_defaults(device);
 
-	q = blk_alloc_queue(GFP_KERNEL);
+	q = blk_alloc_queue(GFP_KERNEL, 0);
 	if (!q)
 		goto out_no_q;
 	device->rq_queue = q;
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index c0e14e5490..0ce25ce95f 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -734,7 +734,8 @@ static int null_add_dev(void)
 			goto out_cleanup_tags;
 		}
 	} else if (queue_mode == NULL_Q_BIO) {
-		nullb->q = blk_alloc_queue_node(GFP_KERNEL, home_node);
+		nullb->q = blk_alloc_queue_node(GFP_KERNEL, home_node,
+						BLK_QUEUE_NO_RESCUER);
 		if (!nullb->q) {
 			rv = -ENOMEM;
 			goto out_cleanup_queues;
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 1b94c1ca5c..3ab1629475 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2737,7 +2737,7 @@ static int pkt_setup_dev(dev_t dev, dev_t* pkt_dev)
 	strcpy(disk->disk_name, pd->name);
 	disk->devnode = pktcdvd_devnode;
 	disk->private_data = pd;
-	disk->queue = blk_alloc_queue(GFP_KERNEL);
+	disk->queue = blk_alloc_queue(GFP_KERNEL, BLK_QUEUE_NO_RESCUER);
 	if (!disk->queue)
 		goto out_mem2;
 
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index 456b4fe215..167e17058c 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -746,7 +746,7 @@ static int ps3vram_probe(struct ps3_system_bus_device *dev)
 	ps3vram_cache_init(dev);
 	ps3vram_proc_init(dev);
 
-	queue = blk_alloc_queue(GFP_KERNEL);
+	queue = blk_alloc_queue(GFP_KERNEL, BLK_QUEUE_NO_RESCUER);
 	if (!queue) {
 		dev_err(&dev->core, "blk_alloc_queue failed\n");
 		error = -ENOMEM;
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index f81d70b39d..e53cea595f 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -266,7 +266,7 @@ int rsxx_setup_dev(struct rsxx_cardinfo *card)
 		return -ENOMEM;
 	}
 
-	card->queue = blk_alloc_queue(GFP_KERNEL);
+	card->queue = blk_alloc_queue(GFP_KERNEL, BLK_QUEUE_NO_RESCUER);
 	if (!card->queue) {
 		dev_err(CARD_TO_DEV(card), "Failed queue alloc\n");
 		unregister_blkdev(card->major, DRIVER_NAME);
diff --git a/drivers/block/umem.c b/drivers/block/umem.c
index c141cc3be2..7d496364c4 100644
--- a/drivers/block/umem.c
+++ b/drivers/block/umem.c
@@ -890,7 +890,7 @@ static int mm_pci_probe(struct pci_dev *dev, const struct pci_device_id *id)
 	card->bio = NULL;
 	card->biotail = &card->bio;
 
-	card->queue = blk_alloc_queue(GFP_KERNEL);
+	card->queue = blk_alloc_queue(GFP_KERNEL, BLK_QUEUE_NO_RESCUER);
 	if (!card->queue)
 		goto failed_alloc;
 
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index e5ab7d9e8c..85ab96f15f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1245,7 +1245,7 @@ static int zram_add(void)
 
 	init_rwsem(&zram->init_lock);
 
-	queue = blk_alloc_queue(GFP_KERNEL);
+	queue = blk_alloc_queue(GFP_KERNEL, BLK_QUEUE_NO_RESCUER);
 	if (!queue) {
 		pr_err("Error allocating disk queue for device %d\n",
 			device_id);
diff --git a/drivers/lightnvm/gennvm.c b/drivers/lightnvm/gennvm.c
index ca7880082d..d36a155b42 100644
--- a/drivers/lightnvm/gennvm.c
+++ b/drivers/lightnvm/gennvm.c
@@ -233,7 +233,7 @@ static int gen_create_tgt(struct nvm_dev *dev, struct nvm_ioctl_create *create)
 		goto err_reserve;
 	}
 
-	tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node);
+	tqueue = blk_alloc_queue_node(GFP_KERNEL, dev->q->node, 0);
 	if (!tqueue)
 		goto err_dev;
 	blk_queue_make_request(tqueue, tt->make_rq);
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 3a19cbc8b2..9cdbeb54f6 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -800,7 +800,7 @@ static int bcache_device_init(struct bcache_device *d, unsigned block_size,
 	d->disk->fops		= &bcache_ops;
 	d->disk->private_data	= d;
 
-	q = blk_alloc_queue(GFP_KERNEL);
+	q = blk_alloc_queue(GFP_KERNEL, 0);
 	if (!q)
 		return -ENOMEM;
 
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3086da5664..e1b22a68d9 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1490,7 +1490,7 @@ static struct mapped_device *alloc_dev(int minor)
 	INIT_LIST_HEAD(&md->table_devices);
 	spin_lock_init(&md->uevent_lock);
 
-	md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id);
+	md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id, 0);
 	if (!md->queue)
 		goto bad;
 
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 01175dac0d..0038d241d7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5061,7 +5061,7 @@ static int md_alloc(dev_t dev, char *name)
 	}
 
 	error = -ENOMEM;
-	mddev->queue = blk_alloc_queue(GFP_KERNEL);
+	mddev->queue = blk_alloc_queue(GFP_KERNEL, 0);
 	if (!mddev->queue)
 		goto abort;
 	mddev->queue->queuedata = mddev;
diff --git a/drivers/nvdimm/blk.c b/drivers/nvdimm/blk.c
index 9faaa9694d..a1d2e7a6ab 100644
--- a/drivers/nvdimm/blk.c
+++ b/drivers/nvdimm/blk.c
@@ -264,7 +264,7 @@ static int nsblk_attach_disk(struct nd_namespace_blk *nsblk)
 	internal_nlba = div_u64(nsblk->size, nsblk_internal_lbasize(nsblk));
 	available_disk_size = internal_nlba * nsblk_sector_size(nsblk);
 
-	q = blk_alloc_queue(GFP_KERNEL);
+	q = blk_alloc_queue(GFP_KERNEL, BLK_QUEUE_NO_RESCUER);
 	if (!q)
 		return -ENOMEM;
 	if (devm_add_action_or_reset(dev, nd_blk_release_queue, q))
diff --git a/drivers/nvdimm/btt.c b/drivers/nvdimm/btt.c
index 368795aad5..7bd6135b77 100644
--- a/drivers/nvdimm/btt.c
+++ b/drivers/nvdimm/btt.c
@@ -1232,7 +1232,7 @@ static int btt_blk_init(struct btt *btt)
 	struct nd_namespace_common *ndns = nd_btt->ndns;
 
 	/* create a new disk and request queue for btt */
-	btt->btt_queue = blk_alloc_queue(GFP_KERNEL);
+	btt->btt_queue = blk_alloc_queue(GFP_KERNEL, BLK_QUEUE_NO_RESCUER);
 	if (!btt->btt_queue)
 		return -ENOMEM;
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 5b536be5a1..314ac480bf 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -280,7 +280,8 @@ static int pmem_attach_disk(struct device *dev,
 		return -EBUSY;
 	}
 
-	q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev));
+	q = blk_alloc_queue_node(GFP_KERNEL, dev_to_node(dev),
+				 BLK_QUEUE_NO_RESCUER);
 	if (!q)
 		return -ENOMEM;
 
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 9d66b4fb17..101e0ae2f7 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -612,7 +612,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
 	}
 	dev_info->gd->major = dcssblk_major;
 	dev_info->gd->fops = &dcssblk_devops;
-	dev_info->dcssblk_queue = blk_alloc_queue(GFP_KERNEL);
+	dev_info->dcssblk_queue = blk_alloc_queue(GFP_KERNEL, BLK_QUEUE_NO_RESCUER);
 	dev_info->gd->queue = dev_info->dcssblk_queue;
 	dev_info->gd->private_data = dev_info;
 	blk_queue_make_request(dev_info->dcssblk_queue, dcssblk_make_request);
diff --git a/drivers/s390/block/xpram.c b/drivers/s390/block/xpram.c
index b9d7e755c8..72f52de17b 100644
--- a/drivers/s390/block/xpram.c
+++ b/drivers/s390/block/xpram.c
@@ -342,7 +342,7 @@ static int __init xpram_setup_blkdev(void)
 		xpram_disks[i] = alloc_disk(1);
 		if (!xpram_disks[i])
 			goto out;
-		xpram_queues[i] = blk_alloc_queue(GFP_KERNEL);
+		xpram_queues[i] = blk_alloc_queue(GFP_KERNEL, BLK_QUEUE_NO_RESCUER);
 		if (!xpram_queues[i]) {
 			put_disk(xpram_disks[i]);
 			goto out;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7cf8a6c70a..ac333e9528 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -656,6 +656,13 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
 	return bio;
 }
 
+struct bio_plug_list {
+	struct bio_list		bios;
+	struct request_queue	*q;
+};
+
+void blk_punt_blocked_bios(struct bio_plug_list *);
+
 /*
  * Increment chain count for the bio. Make sure the CHAIN flag update
  * is visible before the raised count.
@@ -685,15 +692,6 @@ struct bio_set {
 	mempool_t *bio_integrity_pool;
 	mempool_t *bvec_integrity_pool;
 #endif
-
-	/*
-	 * Deadlock avoidance for stacking block drivers: see comments in
-	 * bio_alloc_bioset() for details
-	 */
-	spinlock_t		rescue_lock;
-	struct bio_list		rescue_list;
-	struct work_struct	rescue_work;
-	struct workqueue_struct	*rescue_workqueue;
 };
 
 struct biovec_slab {
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1ca8e8fd10..01acaf9bf9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -570,6 +570,16 @@ struct request_queue {
 	struct bio_set		*bio_split;
 
 	bool			mq_sysfs_init_done;
+
+	/*
+	 * Deadlock avoidance, to deal with the plugging in
+	 * generic_make_request() that converts recursion to iteration to avoid
+	 * stack overflow:
+	 */
+	spinlock_t		rescue_lock;
+	struct bio_list		rescue_list;
+	struct work_struct	rescue_work;
+	struct workqueue_struct	*rescue_workqueue;
 };
 
 #define QUEUE_FLAG_QUEUED	1	/* uses generic tag queueing */
@@ -1192,9 +1202,11 @@ extern int blk_rq_map_sg(struct request_queue *, struct request *, struct scatte
 extern void blk_dump_rq_flags(struct request *, char *);
 extern long nr_blockdev_pages(void);
 
+#define BLK_QUEUE_NO_RESCUER		1
+
 bool __must_check blk_get_queue(struct request_queue *);
-struct request_queue *blk_alloc_queue(gfp_t);
-struct request_queue *blk_alloc_queue_node(gfp_t, int);
+struct request_queue *blk_alloc_queue(gfp_t, int);
+struct request_queue *blk_alloc_queue_node(gfp_t, int, int);
 extern void blk_put_queue(struct request_queue *);
 extern void blk_set_queue_dying(struct request_queue *);
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad3ec9ec61..574ddc4f13 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1797,7 +1797,7 @@ struct task_struct {
 	void *journal_info;
 
 /* stacked block device info */
-	struct bio_list *bio_list;
+	struct bio_plug_list *bio_list;
 
 #ifdef CONFIG_BLOCK
 /* stack plugging */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c56fb57f29..07309d9610 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3440,6 +3440,12 @@ static inline void sched_submit_work(struct task_struct *tsk)
 {
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
+
+	if (tsk->bio_list &&
+	    !bio_list_empty(&tsk->bio_list->bios) &&
+	    tsk->bio_list->q->rescue_workqueue)
+		blk_punt_blocked_bios(tsk->bio_list);
+
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH] block: Make rescuer threads per request_queue, not per bioset
  2017-02-08  4:58                   ` Kent Overstreet
@ 2017-02-08  6:22                     ` kbuild test robot
  2017-02-08  6:23                     ` kbuild test robot
                                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 33+ messages in thread
From: kbuild test robot @ 2017-02-08  6:22 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: kbuild-all, Pavel Machek, Mike Snitzer, kernel list, axboe, hch,
	neilb, martin.petersen, dpark, ming.l, dm-devel, ming.lei, agk,
	jkosina, geoff, jim, pjk1939, minchan, ngupta, oleg.drokin,
	andreas.dilger

[-- Attachment #1: Type: text/plain, Size: 1936 bytes --]

Hi Kent,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.10-rc7]
[cannot apply to next-20170207]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kent-Overstreet/block-Make-rescuer-threads-per-request_queue-not-per-bioset/20170208-130414
config: x86_64-randconfig-x014-201706 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

   kernel/sched/core.c: In function 'sched_submit_work':
>> kernel/sched/core.c:3445:7: error: implicit declaration of function 'bio_list_empty' [-Werror=implicit-function-declaration]
         !bio_list_empty(&tsk->bio_list->bios) &&
          ^~~~~~~~~~~~~~
>> kernel/sched/core.c:3445:36: error: dereferencing pointer to incomplete type 'struct bio_plug_list'
         !bio_list_empty(&tsk->bio_list->bios) &&
                                       ^~
>> kernel/sched/core.c:3447:3: error: implicit declaration of function 'blk_punt_blocked_bios' [-Werror=implicit-function-declaration]
      blk_punt_blocked_bios(tsk->bio_list);
      ^~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/bio_list_empty +3445 kernel/sched/core.c

  3439	static inline void sched_submit_work(struct task_struct *tsk)
  3440	{
  3441		if (!tsk->state || tsk_is_pi_blocked(tsk))
  3442			return;
  3443	
  3444		if (tsk->bio_list &&
> 3445		    !bio_list_empty(&tsk->bio_list->bios) &&
  3446		    tsk->bio_list->q->rescue_workqueue)
> 3447			blk_punt_blocked_bios(tsk->bio_list);
  3448	
  3449		/*
  3450		 * If we are going to sleep and we have plugged IO queued,

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 21681 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH] block: Make rescuer threads per request_queue, not per bioset
  2017-02-08  4:58                   ` Kent Overstreet
  2017-02-08  6:22                     ` [PATCH] block: Make rescuer threads per request_queue, not per bioset kbuild test robot
@ 2017-02-08  6:23                     ` kbuild test robot
  2017-02-08  6:57                     ` v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone Mike Galbraith
  2017-02-08 16:34                     ` Mike Snitzer
  3 siblings, 0 replies; 33+ messages in thread
From: kbuild test robot @ 2017-02-08  6:23 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: kbuild-all, Pavel Machek, Mike Snitzer, kernel list, axboe, hch,
	neilb, martin.petersen, dpark, ming.l, dm-devel, ming.lei, agk,
	jkosina, geoff, jim, pjk1939, minchan, ngupta, oleg.drokin,
	andreas.dilger

[-- Attachment #1: Type: text/plain, Size: 3433 bytes --]

Hi Kent,

[auto build test WARNING on linus/master]
[also build test WARNING on v4.10-rc7]
[cannot apply to next-20170207]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Kent-Overstreet/block-Make-rescuer-threads-per-request_queue-not-per-bioset/20170208-130414
config: x86_64-randconfig-x017-201706 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/uapi/linux/stddef.h:1:0,
                    from include/linux/stddef.h:4,
                    from include/uapi/linux/posix_types.h:4,
                    from include/uapi/linux/types.h:13,
                    from include/linux/types.h:5,
                    from include/uapi/linux/capability.h:16,
                    from include/linux/capability.h:15,
                    from include/linux/sched.h:15,
                    from include/linux/kasan.h:4,
                    from kernel/sched/core.c:29:
   kernel/sched/core.c: In function 'sched_submit_work':
   kernel/sched/core.c:3445:7: error: implicit declaration of function 'bio_list_empty' [-Werror=implicit-function-declaration]
         !bio_list_empty(&tsk->bio_list->bios) &&
          ^
   include/linux/compiler.h:149:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> kernel/sched/core.c:3444:2: note: in expansion of macro 'if'
     if (tsk->bio_list &&
     ^~
   kernel/sched/core.c:3445:36: error: dereferencing pointer to incomplete type 'struct bio_plug_list'
         !bio_list_empty(&tsk->bio_list->bios) &&
                                       ^
   include/linux/compiler.h:149:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> kernel/sched/core.c:3444:2: note: in expansion of macro 'if'
     if (tsk->bio_list &&
     ^~
   kernel/sched/core.c:3447:3: error: implicit declaration of function 'blk_punt_blocked_bios' [-Werror=implicit-function-declaration]
      blk_punt_blocked_bios(tsk->bio_list);
      ^~~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/if +3444 kernel/sched/core.c

  3428	
  3429		/* causes final put_task_struct in finish_task_switch(). */
  3430		__set_current_state(TASK_DEAD);
  3431		current->flags |= PF_NOFREEZE;	/* tell freezer to ignore us */
  3432		__schedule(false);
  3433		BUG();
  3434		/* Avoid "noreturn function does return".  */
  3435		for (;;)
  3436			cpu_relax();	/* For when BUG is null */
  3437	}
  3438	
  3439	static inline void sched_submit_work(struct task_struct *tsk)
  3440	{
  3441		if (!tsk->state || tsk_is_pi_blocked(tsk))
  3442			return;
  3443	
> 3444		if (tsk->bio_list &&
  3445		    !bio_list_empty(&tsk->bio_list->bios) &&
  3446		    tsk->bio_list->q->rescue_workqueue)
  3447			blk_punt_blocked_bios(tsk->bio_list);
  3448	
  3449		/*
  3450		 * If we are going to sleep and we have plugged IO queued,
  3451		 * make sure to submit it to avoid deadlocks.
  3452		 */

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 23909 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2017-02-08  4:58                   ` Kent Overstreet
  2017-02-08  6:22                     ` [PATCH] block: Make rescuer threads per request_queue, not per bioset kbuild test robot
  2017-02-08  6:23                     ` kbuild test robot
@ 2017-02-08  6:57                     ` Mike Galbraith
  2017-02-08 16:34                     ` Mike Snitzer
  3 siblings, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2017-02-08  6:57 UTC (permalink / raw)
  To: Kent Overstreet, Pavel Machek
  Cc: Mike Snitzer, kernel list, axboe, hch, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger

On Tue, 2017-02-07 at 19:58 -0900, Kent Overstreet wrote:
> On Tue, Feb 07, 2017 at 09:39:11PM +0100, Pavel Machek wrote:
> > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > > 
> > > > > So.. what needs to be done there?
> > > 
> > > > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > > > I'll try and come up with a patch...
> > > 
> > > Ok, here's such a patch, only lightly tested:
> > 
> > I guess it would be nice for me to test it... but what it is against?
> > I tried after v4.10-rc5 and linux-next, but got rejects in both cases.
> 
> Sorry, I forgot I had a few other patches in my branch that touch
> mempool/biosets code.
> 
> Also, after thinking about it more and looking at the relevant code, I'm pretty
> sure we don't need rescuer threads for block devices that just split bios - i.e.
> most of them, so I changed my patch to do that.
> 
> Tested it by ripping out the current->bio_list checks/workarounds from the
> bcache code, appears to work:

Patch killed every last one of them, but..

homer:/root # dmesg|grep WARNING
[   11.701447] WARNING: CPU: 4 PID: 801 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240
[   11.711027] WARNING: CPU: 4 PID: 801 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0
[   19.728989] WARNING: CPU: 0 PID: 717 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240
[   19.737020] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0
[   19.746173] WARNING: CPU: 0 PID: 717 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240
[   19.755260] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0
[   19.763837] WARNING: CPU: 0 PID: 717 at block/bio.c:388 bio_alloc_bioset+0x1a7/0x240
[   19.772526] WARNING: CPU: 0 PID: 717 at block/blk-core.c:2013 generic_make_request+0x191/0x1f0

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2017-02-08  4:58                   ` Kent Overstreet
                                       ` (2 preceding siblings ...)
  2017-02-08  6:57                     ` v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone Mike Galbraith
@ 2017-02-08 16:34                     ` Mike Snitzer
  2017-02-09 21:25                       ` Kent Overstreet
  3 siblings, 1 reply; 33+ messages in thread
From: Mike Snitzer @ 2017-02-08 16:34 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Pavel Machek, kernel list, axboe, hch, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger,
	linux-block

On Tue, Feb 07 2017 at 11:58pm -0500,
Kent Overstreet <kent.overstreet@gmail.com> wrote:

> On Tue, Feb 07, 2017 at 09:39:11PM +0100, Pavel Machek wrote:
> > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > > 
> > > > > So.. what needs to be done there?
> > > 
> > > > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > > > I'll try and come up with a patch...
> > > 
> > > Ok, here's such a patch, only lightly tested:
> > 
> > I guess it would be nice for me to test it... but what it is against?
> > I tried after v4.10-rc5 and linux-next, but got rejects in both cases.
> 
> Sorry, I forgot I had a few other patches in my branch that touch
> mempool/biosets code.
> 
> Also, after thinking about it more and looking at the relevant code, I'm pretty
> sure we don't need rescuer threads for block devices that just split bios - i.e.
> most of them, so I changed my patch to do that.
> 
> Tested it by ripping out the current->bio_list checks/workarounds from the
> bcache code, appears to work:

Feedback on this patch below, but first:

There are deeper issues with the current->bio_list and rescue workqueues
than thread counts.

I cannot help but feel like you (and Jens) are repeatedly ignoring the
issue that has been raised numerous times, most recently:
https://www.redhat.com/archives/dm-devel/2017-February/msg00059.html

FYI, this test (albeit ugly) can be used to check if the dm-snapshot
deadlock is fixed:
https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html

This situation is the unfortunate pathological worst case for what
happens when changes are merged and nobody wants to own fixing the
unforseen implications/regressions.   Like everyone else in a position
of Linux maintenance I've tried to stay away from owning the
responsibility of a fix -- it isn't working.  Ok, I'll stop bitching
now.. I do bear responsibility for not digging in myself.  We're all
busy and this issue is "hard".

> -- >8 --
> Subject: [PATCH] block: Make rescuer threads per request_queue, not per bioset
> 
> Also, trigger rescuing whenever with bios on current->bio_list, instead
> of only when we block in bio_alloc_bioset(). This is more correct, and
> should result in fewer rescuer threads.
> 
> XXX: The current->bio_list plugging needs to be unified with the
> blk_plug mechanism.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
...
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 3086da5664..e1b22a68d9 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1490,7 +1490,7 @@ static struct mapped_device *alloc_dev(int minor)
>  	INIT_LIST_HEAD(&md->table_devices);
>  	spin_lock_init(&md->uevent_lock);
>  
> -	md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id);
> +	md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id, 0);
>  	if (!md->queue)
>  		goto bad;
>  

This should be BLK_QUEUE_NO_RESCUER as DM isn't making direct use of
bio_queue_split() for its own internal spliting (maybe it should and
that'd start to fix the issue I've been harping about?) but as is DM
destroys the rescuer workqueue (since commit dbba42d8a9eb "dm: eliminate
unused "bioset" process for each bio-based DM device").

Mike

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2017-02-08 16:34                     ` Mike Snitzer
@ 2017-02-09 21:25                       ` Kent Overstreet
  2017-02-14 16:34                         ` [dm-devel] " Mikulas Patocka
  2017-02-14 17:33                         ` Mike Snitzer
  0 siblings, 2 replies; 33+ messages in thread
From: Kent Overstreet @ 2017-02-09 21:25 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Pavel Machek, kernel list, axboe, hch, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger,
	linux-block

On Wed, Feb 08, 2017 at 11:34:07AM -0500, Mike Snitzer wrote:
> On Tue, Feb 07 2017 at 11:58pm -0500,
> Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> > On Tue, Feb 07, 2017 at 09:39:11PM +0100, Pavel Machek wrote:
> > > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > > > 
> > > > > > So.. what needs to be done there?
> > > > 
> > > > > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > > > > I'll try and come up with a patch...
> > > > 
> > > > Ok, here's such a patch, only lightly tested:
> > > 
> > > I guess it would be nice for me to test it... but what it is against?
> > > I tried after v4.10-rc5 and linux-next, but got rejects in both cases.
> > 
> > Sorry, I forgot I had a few other patches in my branch that touch
> > mempool/biosets code.
> > 
> > Also, after thinking about it more and looking at the relevant code, I'm pretty
> > sure we don't need rescuer threads for block devices that just split bios - i.e.
> > most of them, so I changed my patch to do that.
> > 
> > Tested it by ripping out the current->bio_list checks/workarounds from the
> > bcache code, appears to work:
> 
> Feedback on this patch below, but first:
> 
> There are deeper issues with the current->bio_list and rescue workqueues
> than thread counts.
> 
> I cannot help but feel like you (and Jens) are repeatedly ignoring the
> issue that has been raised numerous times, most recently:
> https://www.redhat.com/archives/dm-devel/2017-February/msg00059.html
> 
> FYI, this test (albeit ugly) can be used to check if the dm-snapshot
> deadlock is fixed:
> https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> 
> This situation is the unfortunate pathological worst case for what
> happens when changes are merged and nobody wants to own fixing the
> unforseen implications/regressions.   Like everyone else in a position
> of Linux maintenance I've tried to stay away from owning the
> responsibility of a fix -- it isn't working.  Ok, I'll stop bitching
> now.. I do bear responsibility for not digging in myself.  We're all
> busy and this issue is "hard".

Mike, it's not my job to debug DM code for you or sift through your bug reports.
I don't read dm-devel, and I don't know why you think I that's my job.

If there's something you think the block layer should be doing differently, post
patches - or at the very least, explain what you'd like to be done, with words.
Don't get pissy because I'm not sifting through your bug reports.

Hell, I'm not getting paid to work on kernel code at all right now, and you
trying to rope me into fixing device mapper sure makes me want to work on the
block layer more.

DM developers have a long history of working siloed off from the rest of the
block layer, building up their own crazy infrastructure (remember the old bio
splitting code?) and going to extreme lengths to avoid having to work on or
improve the core block layer infrastructure. It's ridiculous.

You know what would be nice? What'd really make my day is if just once I got a
thank you or a bit of appreciation from DM developers for the bvec iterators/bio
splitting work I did that cleaned up a _lot_ of crazy hairy messes. Or getting
rid of merge_bvec_fn, or trying to come up with a better solution for deadlocks
due to running under generic_make_request() now.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [dm-devel] v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2017-02-09 21:25                       ` Kent Overstreet
@ 2017-02-14 16:34                         ` Mikulas Patocka
  2017-02-14 17:33                         ` Mike Snitzer
  1 sibling, 0 replies; 33+ messages in thread
From: Mikulas Patocka @ 2017-02-14 16:34 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Mike Snitzer, oleg.drokin, ming.l, andreas.dilger,
	martin.petersen, minchan, jkosina, ming.lei, kernel list, jim,
	pjk1939, axboe, geoff, dm-devel, linux-block, dpark,
	Pavel Machek, ngupta, hch, agk



On Thu, 9 Feb 2017, Kent Overstreet wrote:

> On Wed, Feb 08, 2017 at 11:34:07AM -0500, Mike Snitzer wrote:
> > On Tue, Feb 07 2017 at 11:58pm -0500,
> > Kent Overstreet <kent.overstreet@gmail.com> wrote:
> > 
> > > On Tue, Feb 07, 2017 at 09:39:11PM +0100, Pavel Machek wrote:
> > > > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > > > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > > > > 
> > > > > > > So.. what needs to be done there?
> > > > > 
> > > > > > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > > > > > I'll try and come up with a patch...
> > > > > 
> > > > > Ok, here's such a patch, only lightly tested:
> > > > 
> > > > I guess it would be nice for me to test it... but what it is against?
> > > > I tried after v4.10-rc5 and linux-next, but got rejects in both cases.
> > > 
> > > Sorry, I forgot I had a few other patches in my branch that touch
> > > mempool/biosets code.
> > > 
> > > Also, after thinking about it more and looking at the relevant code, I'm pretty
> > > sure we don't need rescuer threads for block devices that just split bios - i.e.
> > > most of them, so I changed my patch to do that.
> > > 
> > > Tested it by ripping out the current->bio_list checks/workarounds from the
> > > bcache code, appears to work:
> > 
> > Feedback on this patch below, but first:
> > 
> > There are deeper issues with the current->bio_list and rescue workqueues
> > than thread counts.
> > 
> > I cannot help but feel like you (and Jens) are repeatedly ignoring the
> > issue that has been raised numerous times, most recently:
> > https://www.redhat.com/archives/dm-devel/2017-February/msg00059.html
> > 
> > FYI, this test (albeit ugly) can be used to check if the dm-snapshot
> > deadlock is fixed:
> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> > 
> > This situation is the unfortunate pathological worst case for what
> > happens when changes are merged and nobody wants to own fixing the
> > unforseen implications/regressions.   Like everyone else in a position
> > of Linux maintenance I've tried to stay away from owning the
> > responsibility of a fix -- it isn't working.  Ok, I'll stop bitching
> > now.. I do bear responsibility for not digging in myself.  We're all
> > busy and this issue is "hard".
> 
> Mike, it's not my job to debug DM code for you or sift through your bug reports.
> I don't read dm-devel, and I don't know why you think I that's my job.
> 
> If there's something you think the block layer should be doing differently, post
> patches - or at the very least, explain what you'd like to be done, with words.
> Don't get pissy because I'm not sifting through your bug reports.

So I post this patch for that bug.

Will any of the block device maintainers respond to it?



From: Mikulas Patocka <mpatocka@redhat.com>
Date: Tue, 27 May 2014 11:03:36 -0400
Subject: block: flush queued bios when process blocks to avoid deadlock

The block layer uses per-process bio list to avoid recursion in
generic_make_request.  When generic_make_request is called recursively,
the bio is added to current->bio_list and generic_make_request returns
immediately.  The top-level instance of generic_make_request takes bios
from current->bio_list and processes them.

The problem is that this bio queuing on current->bio_list creates an 
artifical locking dependency - a bio further on current->bio_list depends 
on any locks that preceding bios could take.  This could result in a 
deadlock.

Commit df2cb6daa4 ("block: Avoid deadlocks with bio allocation by
stacking drivers") created a workqueue for every bio set and code
in bio_alloc_bioset() that tries to resolve some low-memory deadlocks by
redirecting bios queued on current->bio_list to the workqueue if the
system is low on memory.  However another deadlock (see below **) may
happen, without any low memory condition, because generic_make_request
is queuing bios to current->bio_list (rather than submitting them).

Fix this deadlock by redirecting any bios on current->bio_list to the
bio_set's rescue workqueue on every schedule call.  Consequently, when
the process blocks on a mutex, the bios queued on current->bio_list are
dispatched to independent workqueus and they can complete without
waiting for the mutex to be available.

Also, now we can remove punt_bios_to_rescuer() and bio_alloc_bioset()'s
calls to it because bio_alloc_bioset() will implicitly punt all bios on
current->bio_list if it performs a blocking allocation.

** Here is the dm-snapshot deadlock that was observed:

1) Process A sends one-page read bio to the dm-snapshot target. The bio
spans snapshot chunk boundary and so it is split to two bios by device
mapper.

2) Device mapper creates the first sub-bio and sends it to the snapshot
driver.

3) The function snapshot_map calls track_chunk (that allocates a structure
dm_snap_tracked_chunk and adds it to tracked_chunk_hash) and then remaps
the bio to the underlying device and exits with DM_MAPIO_REMAPPED.

4) The remapped bio is submitted with generic_make_request, but it isn't
issued - it is added to current->bio_list instead.

5) Meanwhile, process B (dm's kcopyd) executes pending_complete for the
chunk affected be the first remapped bio, it takes down_write(&s->lock)
and then loops in __check_for_conflicting_io, waiting for
dm_snap_tracked_chunk created in step 3) to be released.

6) Process A continues, it creates a second sub-bio for the rest of the
original bio.

7) snapshot_map is called for this new bio, it waits on
down_write(&s->lock) that is held by Process B (in step 5).

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1267650
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Depends-on: df2cb6daa4 ("block: Avoid deadlocks with bio allocation by stacking drivers")
Cc: stable@vger.kernel.org

---
 block/bio.c            |   77 +++++++++++++++++++------------------------------
 include/linux/blkdev.h |   24 ++++++++++-----
 kernel/sched/core.c    |    7 +---
 3 files changed, 50 insertions(+), 58 deletions(-)

Index: linux-4.10-rc2/block/bio.c
===================================================================
--- linux-4.10-rc2.orig/block/bio.c
+++ linux-4.10-rc2/block/bio.c
@@ -357,35 +357,37 @@ static void bio_alloc_rescue(struct work
 	}
 }
 
-static void punt_bios_to_rescuer(struct bio_set *bs)
+/**
+ * blk_flush_bio_list
+ * @tsk: task_struct whose bio_list must be flushed
+ *
+ * Pop bios queued on @tsk->bio_list and submit each of them to
+ * their rescue workqueue.
+ *
+ * If the bio doesn't have a bio_set, we leave it on @tsk->bio_list.
+ * If the bio is allocated from fs_bio_set, we must leave it to avoid
+ * deadlock on loopback block device.
+ * Stacking bio drivers should use bio_set, so this shouldn't be
+ * an issue.
+ */
+void blk_flush_bio_list(struct task_struct *tsk)
 {
-	struct bio_list punt, nopunt;
 	struct bio *bio;
+	struct bio_list list = *tsk->bio_list;
+	bio_list_init(tsk->bio_list);
 
-	/*
-	 * In order to guarantee forward progress we must punt only bios that
-	 * were allocated from this bio_set; otherwise, if there was a bio on
-	 * there for a stacking driver higher up in the stack, processing it
-	 * could require allocating bios from this bio_set, and doing that from
-	 * our own rescuer would be bad.
-	 *
-	 * Since bio lists are singly linked, pop them all instead of trying to
-	 * remove from the middle of the list:
-	 */
-
-	bio_list_init(&punt);
-	bio_list_init(&nopunt);
-
-	while ((bio = bio_list_pop(current->bio_list)))
-		bio_list_add(bio->bi_pool == bs ? &punt : &nopunt, bio);
-
-	*current->bio_list = nopunt;
-
-	spin_lock(&bs->rescue_lock);
-	bio_list_merge(&bs->rescue_list, &punt);
-	spin_unlock(&bs->rescue_lock);
+	while ((bio = bio_list_pop(&list))) {
+		struct bio_set *bs = bio->bi_pool;
+		if (unlikely(!bs) || bs == fs_bio_set) {
+			bio_list_add(tsk->bio_list, bio);
+			continue;
+		}
 
-	queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		spin_lock(&bs->rescue_lock);
+		bio_list_add(&bs->rescue_list, bio);
+		queue_work(bs->rescue_workqueue, &bs->rescue_work);
+		spin_unlock(&bs->rescue_lock);
+	}
 }
 
 /**
@@ -425,7 +427,6 @@ static void punt_bios_to_rescuer(struct
  */
 struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 {
-	gfp_t saved_gfp = gfp_mask;
 	unsigned front_pad;
 	unsigned inline_vecs;
 	struct bio_vec *bvl = NULL;
@@ -459,23 +460,11 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
 		 * reserve.
 		 *
 		 * We solve this, and guarantee forward progress, with a rescuer
-		 * workqueue per bio_set. If we go to allocate and there are
-		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
-		 * bios we would be blocking to the rescuer workqueue before
-		 * we retry with the original gfp_flags.
+		 * workqueue per bio_set. If an allocation would block (due to
+		 * __GFP_DIRECT_RECLAIM) the scheduler will first punt all bios
+		 * on current->bio_list to the rescuer workqueue.
 		 */
-
-		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
-
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
-		if (!p && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			p = mempool_alloc(bs->bio_pool, gfp_mask);
-		}
-
 		front_pad = bs->front_pad;
 		inline_vecs = BIO_INLINE_VECS;
 	}
@@ -490,12 +479,6 @@ struct bio *bio_alloc_bioset(gfp_t gfp_m
 		unsigned long idx = 0;
 
 		bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		if (!bvl && gfp_mask != saved_gfp) {
-			punt_bios_to_rescuer(bs);
-			gfp_mask = saved_gfp;
-			bvl = bvec_alloc(gfp_mask, nr_iovecs, &idx, bs->bvec_pool);
-		}
-
 		if (unlikely(!bvl))
 			goto err_free;
 
Index: linux-4.10-rc2/include/linux/blkdev.h
===================================================================
--- linux-4.10-rc2.orig/include/linux/blkdev.h
+++ linux-4.10-rc2/include/linux/blkdev.h
@@ -1267,6 +1267,22 @@ static inline bool blk_needs_flush_plug(
 		 !list_empty(&plug->cb_list));
 }
 
+extern void blk_flush_bio_list(struct task_struct *tsk);
+
+static inline void blk_flush_queued_io(struct task_struct *tsk)
+{
+	/*
+	 * Flush any queued bios to corresponding rescue threads.
+	 */
+	if (tsk->bio_list && !bio_list_empty(tsk->bio_list))
+		blk_flush_bio_list(tsk);
+	/*
+	 * Flush any plugged IO that is queued.
+	 */
+	if (blk_needs_flush_plug(tsk))
+		blk_schedule_flush_plug(tsk);
+}
+
 /*
  * tag stuff
  */
@@ -1921,16 +1937,10 @@ static inline void blk_flush_plug(struct
 {
 }
 
-static inline void blk_schedule_flush_plug(struct task_struct *task)
+static inline void blk_flush_queued_io(struct task_struct *tsk)
 {
 }
 
-
-static inline bool blk_needs_flush_plug(struct task_struct *tsk)
-{
-	return false;
-}
-
 static inline int blkdev_issue_flush(struct block_device *bdev, gfp_t gfp_mask,
 				     sector_t *error_sector)
 {
Index: linux-4.10-rc2/kernel/sched/core.c
===================================================================
--- linux-4.10-rc2.orig/kernel/sched/core.c
+++ linux-4.10-rc2/kernel/sched/core.c
@@ -3441,11 +3441,10 @@ static inline void sched_submit_work(str
 	if (!tsk->state || tsk_is_pi_blocked(tsk))
 		return;
 	/*
-	 * If we are going to sleep and we have plugged IO queued,
+	 * If we are going to sleep and we have queued IO,
 	 * make sure to submit it to avoid deadlocks.
 	 */
-	if (blk_needs_flush_plug(tsk))
-		blk_schedule_flush_plug(tsk);
+	blk_flush_queued_io(tsk);
 }
 
 asmlinkage __visible void __sched schedule(void)
@@ -5068,7 +5067,7 @@ long __sched io_schedule_timeout(long ti
 	long ret;
 
 	current->in_iowait = 1;
-	blk_schedule_flush_plug(current);
+	blk_flush_queued_io(current);
 
 	delayacct_blkio_start();
 	rq = raw_rq();

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone
  2017-02-09 21:25                       ` Kent Overstreet
  2017-02-14 16:34                         ` [dm-devel] " Mikulas Patocka
@ 2017-02-14 17:33                         ` Mike Snitzer
  1 sibling, 0 replies; 33+ messages in thread
From: Mike Snitzer @ 2017-02-14 17:33 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Pavel Machek, kernel list, axboe, hch, neilb, martin.petersen,
	dpark, ming.l, dm-devel, ming.lei, agk, jkosina, geoff, jim,
	pjk1939, minchan, ngupta, oleg.drokin, andreas.dilger,
	linux-block

On Thu, Feb 09 2017 at  4:25pm -0500,
Kent Overstreet <kent.overstreet@gmail.com> wrote:

> On Wed, Feb 08, 2017 at 11:34:07AM -0500, Mike Snitzer wrote:
> > On Tue, Feb 07 2017 at 11:58pm -0500,
> > Kent Overstreet <kent.overstreet@gmail.com> wrote:
> > 
> > > On Tue, Feb 07, 2017 at 09:39:11PM +0100, Pavel Machek wrote:
> > > > On Mon 2017-02-06 17:49:06, Kent Overstreet wrote:
> > > > > On Mon, Feb 06, 2017 at 04:47:24PM -0900, Kent Overstreet wrote:
> > > > > > On Mon, Feb 06, 2017 at 01:53:09PM +0100, Pavel Machek wrote:
> > > > > > > Still there on v4.9, 36 threads on nokia n900 cellphone.
> > > > > > > 
> > > > > > > So.. what needs to be done there?
> > > > > 
> > > > > > But, I just got an idea for how to handle this that might be halfway sane, maybe
> > > > > > I'll try and come up with a patch...
> > > > > 
> > > > > Ok, here's such a patch, only lightly tested:
> > > > 
> > > > I guess it would be nice for me to test it... but what it is against?
> > > > I tried after v4.10-rc5 and linux-next, but got rejects in both cases.
> > > 
> > > Sorry, I forgot I had a few other patches in my branch that touch
> > > mempool/biosets code.
> > > 
> > > Also, after thinking about it more and looking at the relevant code, I'm pretty
> > > sure we don't need rescuer threads for block devices that just split bios - i.e.
> > > most of them, so I changed my patch to do that.
> > > 
> > > Tested it by ripping out the current->bio_list checks/workarounds from the
> > > bcache code, appears to work:
> > 
> > Feedback on this patch below, but first:
> > 
> > There are deeper issues with the current->bio_list and rescue workqueues
> > than thread counts.
> > 
> > I cannot help but feel like you (and Jens) are repeatedly ignoring the
> > issue that has been raised numerous times, most recently:
> > https://www.redhat.com/archives/dm-devel/2017-February/msg00059.html
> > 
> > FYI, this test (albeit ugly) can be used to check if the dm-snapshot
> > deadlock is fixed:
> > https://www.redhat.com/archives/dm-devel/2017-January/msg00064.html
> > 
> > This situation is the unfortunate pathological worst case for what
> > happens when changes are merged and nobody wants to own fixing the
> > unforseen implications/regressions.   Like everyone else in a position
> > of Linux maintenance I've tried to stay away from owning the
> > responsibility of a fix -- it isn't working.  Ok, I'll stop bitching
> > now.. I do bear responsibility for not digging in myself.  We're all
> > busy and this issue is "hard".
> 
> Mike, it's not my job to debug DM code for you or sift through your bug reports.
> I don't read dm-devel, and I don't know why you think I that's my job.
> 
> If there's something you think the block layer should be doing differently, post
> patches - or at the very least, explain what you'd like to be done, with words.
> Don't get pissy because I'm not sifting through your bug reports.
> 
> Hell, I'm not getting paid to work on kernel code at all right now, and you
> trying to rope me into fixing device mapper sure makes me want to work on the
> block layer more.
> 
> DM developers have a long history of working siloed off from the rest of the
> block layer, building up their own crazy infrastructure (remember the old bio
> splitting code?) and going to extreme lengths to avoid having to work on or
> improve the core block layer infrastructure. It's ridiculous.

That is bullshit.  I try to help block core where/when I can (did so
with discards and limits stacking, other fixes here and there).

Your take on what DM code is and how it evolved is way off base.  But
that is to be expected from you.  Not going to waste time laboring over
correcting you.

> You know what would be nice? What'd really make my day is if just once I got a
> thank you or a bit of appreciation from DM developers for the bvec iterators/bio
> splitting work I did that cleaned up a _lot_ of crazy hairy messes. Or getting
> rid of merge_bvec_fn, or trying to come up with a better solution for deadlocks
> due to running under generic_make_request() now.

I do appreciate the immutable biovec work you did.  Helped speed up bio
cloning quite nicely.

But you've always been hostile to DM.  You'll fabricate any excuse to
never touch or care about it.  Repeatedly noted.  But this is a block
core bug/regression.  Not DM.

If you think I'm going to thank you for blindly breaking shit, and
refusing to care when you're made aware of it, then you're out of your
mind.

Save your predictably hostile response.  It'll get marked as spam
anyway.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2017-02-14 17:33 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-11 10:49 4.4-rc: 28 bioset threads on small notebook Pavel Machek
2015-12-11 14:08 ` Mike Snitzer
2015-12-11 17:14   ` Pavel Machek
2016-02-20 17:40   ` 4.4-final: " Pavel Machek
2016-02-20 18:42     ` Pavel Machek
2016-02-20 19:51       ` Mike Snitzer
2016-02-20 20:04         ` Pavel Machek
2016-02-20 20:38           ` Mike Snitzer
2016-02-20 20:55             ` Pavel Machek
2016-02-21  4:15               ` Kent Overstreet
2016-02-21  6:43                 ` Ming Lin-SSI
2016-02-21  9:40                   ` Ming Lei
2016-02-22 22:58                     ` Kent Overstreet
2016-02-23  2:55                       ` Ming Lei
2016-02-23 14:54                         ` Mike Snitzer
2016-02-24  2:48                           ` Ming Lei
2016-02-24  3:23                             ` Kent Overstreet
2016-02-23 20:45                       ` Pavel Machek
2017-02-06 12:53           ` v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone Pavel Machek
2017-02-07  1:47             ` Kent Overstreet
2017-02-07  2:49               ` Kent Overstreet
2017-02-07 17:13                 ` Mike Snitzer
2017-02-07 20:39                 ` Pavel Machek
2017-02-08  3:12                   ` Mike Galbraith
2017-02-08  4:58                   ` Kent Overstreet
2017-02-08  6:22                     ` [PATCH] block: Make rescuer threads per request_queue, not per bioset kbuild test robot
2017-02-08  6:23                     ` kbuild test robot
2017-02-08  6:57                     ` v4.9, 4.4-final: 28 bioset threads on small notebook, 36 threads on cellphone Mike Galbraith
2017-02-08 16:34                     ` Mike Snitzer
2017-02-09 21:25                       ` Kent Overstreet
2017-02-14 16:34                         ` [dm-devel] " Mikulas Patocka
2017-02-14 17:33                         ` Mike Snitzer
2017-02-08  2:47                 ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).