2.6.35 Regression: Ages spent discarding blocks that weren't used!

All of lore.kernel.org
 help / color / mirror / Atom feed

* 2.6.35 Regression: Ages spent discarding blocks that weren't used!
@ 2010-08-04  1:40 Nigel Cunningham
  2010-08-04  8:59 ` Stefan Richter
                   ` (3 more replies)
  0 siblings, 4 replies; 30+ messages in thread
From: Nigel Cunningham @ 2010-08-04  1:40 UTC (permalink / raw)
  To: LKML, pm list

Hi all.

I've just given hibernation a go under 2.6.35, and at first I thought 
there was some sort of hang in freezing processes. The computer sat 
there for aaaaaages, apparently doing nothing. Switched from TuxOnIce to 
swsusp to see if it was specific to my code but no - the problem was 
there too. I used the nifty new kdb support to get a backtrace, which was:

get_swap_page_of_type
discard_swap_cluster
blk_dev_issue_discard
wait_for_completion

Adding a printk in discard swap cluster gives the following:

[   46.758330] Discarding 256 pages from bdev 800003 beginning at page 
640377.
[   47.003363] Discarding 256 pages from bdev 800003 beginning at page 
640633.
[   47.246514] Discarding 256 pages from bdev 800003 beginning at page 
640889.

...

[  221.877465] Discarding 256 pages from bdev 800003 beginning at page 
826745.
[  222.121284] Discarding 256 pages from bdev 800003 beginning at page 
827001.
[  222.365908] Discarding 256 pages from bdev 800003 beginning at page 
827257.
[  222.610311] Discarding 256 pages from bdev 800003 beginning at page 
827513.

So allocating 4GB of swap on my SSD now takes 176 seconds instead of 
virtually no time at all. (This code is completely unchanged from 2.6.34).

I have a couple of questions:

1) As far as I can see, there haven't been any changes in mm/swapfile.c 
that would cause this slowdown, so something in the block layer has 
(from my point of view) regressed. Is this a known issue?

2) Why are we calling discard_swap_cluster anyway? The swap was unused 
and we're allocating it. I could understand calling it when freeing 
swap, but when allocating?

Regards,

Nigel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-04  1:40 2.6.35 Regression: Ages spent discarding blocks that weren't used! Nigel Cunningham
@ 2010-08-04  8:59 ` Stefan Richter
  2010-08-04  9:16   ` Nigel Cunningham
  2010-08-04  9:16   ` Nigel Cunningham
  2010-08-04  8:59 ` Stefan Richter
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 30+ messages in thread
From: Stefan Richter @ 2010-08-04  8:59 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: linux-kernel, linux-pm, linux-scsi

(adding Cc: linux-scsi)

Nigel Cunningham wrote:
> I've just given hibernation a go under 2.6.35, and at first I thought
> there was some sort of hang in freezing processes. The computer sat
> there for aaaaaages, apparently doing nothing. Switched from TuxOnIce to
> swsusp to see if it was specific to my code but no - the problem was
> there too. I used the nifty new kdb support to get a backtrace, which was:
> 
> get_swap_page_of_type
> discard_swap_cluster
> blk_dev_issue_discard
> wait_for_completion
> 
> Adding a printk in discard swap cluster gives the following:
> 
> [   46.758330] Discarding 256 pages from bdev 800003 beginning at page 640377.
> [   47.003363] Discarding 256 pages from bdev 800003 beginning at page 640633.
> [   47.246514] Discarding 256 pages from bdev 800003 beginning at page 640889.
> 
> ...
> 
> [  221.877465] Discarding 256 pages from bdev 800003 beginning at page 826745.
> [  222.121284] Discarding 256 pages from bdev 800003 beginning at page 827001.
> [  222.365908] Discarding 256 pages from bdev 800003 beginning at page 827257.
> [  222.610311] Discarding 256 pages from bdev 800003 beginning at page 827513.
> 
> So allocating 4GB of swap on my SSD now takes 176 seconds instead of
> virtually no time at all. (This code is completely unchanged from 2.6.34).
> 
> I have a couple of questions:
> 
> 1) As far as I can see, there haven't been any changes in mm/swapfile.c
> that would cause this slowdown, so something in the block layer has
> (from my point of view) regressed. Is this a known issue?

Perhaps ATA TRIM is enabled for this SSD in 2.6.35 but not in 2.6.34?
Or the discard code has been changed to issue many moderately sized ATA
TRIMs instead of a single huge one, and the former was much more optimal
for your particular SSD?

> 2) Why are we calling discard_swap_cluster anyway? The swap was unused
> and we're allocating it. I could understand calling it when freeing
> swap, but when allocating?

At the moment when the administrator creates swap space, the kernel can
assume that he has no use anymore for the data that may have existed
previously at this space.  Hence instruct the SSD's flash translation
layer to return all these blocks to the list of unused logical blocks
which do not have to be read and backed up whenever another logical
block within the same erase block is written to.

However, I am surprised that this is done every time (?) when preparing
for hibernation.
-- 
Stefan Richter
-=====-==-=- =--- --=--
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-04  8:59 ` Stefan Richter
@ 2010-08-04  9:16   ` Nigel Cunningham
  2010-08-04  9:16   ` Nigel Cunningham
  1 sibling, 0 replies; 30+ messages in thread
From: Nigel Cunningham @ 2010-08-04  9:16 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux-kernel, linux-pm, linux-scsi

Hi.

On 04/08/10 18:59, Stefan Richter wrote:
> (adding Cc: linux-scsi)
>
> Nigel Cunningham wrote:
>> I've just given hibernation a go under 2.6.35, and at first I thought
>> there was some sort of hang in freezing processes. The computer sat
>> there for aaaaaages, apparently doing nothing. Switched from TuxOnIce to
>> swsusp to see if it was specific to my code but no - the problem was
>> there too. I used the nifty new kdb support to get a backtrace, which was:
>>
>> get_swap_page_of_type
>> discard_swap_cluster
>> blk_dev_issue_discard
>> wait_for_completion
>>
>> Adding a printk in discard swap cluster gives the following:
>>
>> [   46.758330] Discarding 256 pages from bdev 800003 beginning at page 640377.
>> [   47.003363] Discarding 256 pages from bdev 800003 beginning at page 640633.
>> [   47.246514] Discarding 256 pages from bdev 800003 beginning at page 640889.
>>
>> ...
>>
>> [  221.877465] Discarding 256 pages from bdev 800003 beginning at page 826745.
>> [  222.121284] Discarding 256 pages from bdev 800003 beginning at page 827001.
>> [  222.365908] Discarding 256 pages from bdev 800003 beginning at page 827257.
>> [  222.610311] Discarding 256 pages from bdev 800003 beginning at page 827513.
>>
>> So allocating 4GB of swap on my SSD now takes 176 seconds instead of
>> virtually no time at all. (This code is completely unchanged from 2.6.34).
>>
>> I have a couple of questions:
>>
>> 1) As far as I can see, there haven't been any changes in mm/swapfile.c
>> that would cause this slowdown, so something in the block layer has
>> (from my point of view) regressed. Is this a known issue?
>
> Perhaps ATA TRIM is enabled for this SSD in 2.6.35 but not in 2.6.34?
> Or the discard code has been changed to issue many moderately sized ATA
> TRIMs instead of a single huge one, and the former was much more optimal
> for your particular SSD?

Mmmm. Wonder how I tell. Something in dmesg or hdparm -I?

ata3.00: ATA-8: ARSSD56GBP, 1916, max UDMA/133
ata3.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
ata3.00: configured for UDMA/133
scsi 2:0:0:0: Direct-Access     ATA      ARSSD56GBP       1916 PQ: 0 ANSI: 5
sd 2:0:0:0: Attached scsi generic sg1 type 0
sd 2:0:0:0: [sda] 500118192 512-byte logical blocks: (256 GB/238 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't 
support DPO or FUA
sda: sda1 sda2 sda3 sda4
sd 2:0:0:0: [sda] Attached SCSI disk

/dev/sda:

ATA device, with non-removable media
	Model Number:       ARSSD56GBP
	Serial Number:      DC2210200F1B40015
	Firmware Revision:  1916
Standards:
	Supported: 8 7 6 5
	Likely used: 8
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  268435455
	LBA48  user addressable sectors:  500118192
	Logical  Sector size:                   512 bytes
	Physical Sector size:                   512 bytes
	device size with M = 1024*1024:      244198 MBytes
	device size with M = 1000*1000:      256060 MBytes (256 GB)
	cache/buffer size  = unknown
	Nominal Media Rotation Rate: Solid State Device
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 1	Current = 1
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	DOWNLOAD_MICROCODE
	    	SET_MAX security extension
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Phy event counters
	   *	DMA Setup Auto-Activate optimization
	    	Device-initiated interface power management
	   *	Software settings preservation
	   *	Data Set Management determinate TRIM supported
Security:
		supported
	not	enabled
	not	locked
		frozen
	not	expired: security count
	not	supported: enhanced erase
Checksum: correct


>> 2) Why are we calling discard_swap_cluster anyway? The swap was unused
>> and we're allocating it. I could understand calling it when freeing
>> swap, but when allocating?
>
> At the moment when the administrator creates swap space, the kernel can
> assume that he has no use anymore for the data that may have existed
> previously at this space.  Hence instruct the SSD's flash translation
> layer to return all these blocks to the list of unused logical blocks
> which do not have to be read and backed up whenever another logical
> block within the same erase block is written to.
>
> However, I am surprised that this is done every time (?) when preparing
> for hibernation.

It's not hibernation per se. The discard code is called from a few 
places in swapfile.c in (afaict from a quick scan) both swap allocation 
and free paths.

Regards,

Nigel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-04  8:59 ` Stefan Richter
  2010-08-04  9:16   ` Nigel Cunningham
@ 2010-08-04  9:16   ` Nigel Cunningham
  1 sibling, 0 replies; 30+ messages in thread
From: Nigel Cunningham @ 2010-08-04  9:16 UTC (permalink / raw)
  To: Stefan Richter; +Cc: linux-pm, linux-kernel, linux-scsi

Hi.

On 04/08/10 18:59, Stefan Richter wrote:
> (adding Cc: linux-scsi)
>
> Nigel Cunningham wrote:
>> I've just given hibernation a go under 2.6.35, and at first I thought
>> there was some sort of hang in freezing processes. The computer sat
>> there for aaaaaages, apparently doing nothing. Switched from TuxOnIce to
>> swsusp to see if it was specific to my code but no - the problem was
>> there too. I used the nifty new kdb support to get a backtrace, which was:
>>
>> get_swap_page_of_type
>> discard_swap_cluster
>> blk_dev_issue_discard
>> wait_for_completion
>>
>> Adding a printk in discard swap cluster gives the following:
>>
>> [   46.758330] Discarding 256 pages from bdev 800003 beginning at page 640377.
>> [   47.003363] Discarding 256 pages from bdev 800003 beginning at page 640633.
>> [   47.246514] Discarding 256 pages from bdev 800003 beginning at page 640889.
>>
>> ...
>>
>> [  221.877465] Discarding 256 pages from bdev 800003 beginning at page 826745.
>> [  222.121284] Discarding 256 pages from bdev 800003 beginning at page 827001.
>> [  222.365908] Discarding 256 pages from bdev 800003 beginning at page 827257.
>> [  222.610311] Discarding 256 pages from bdev 800003 beginning at page 827513.
>>
>> So allocating 4GB of swap on my SSD now takes 176 seconds instead of
>> virtually no time at all. (This code is completely unchanged from 2.6.34).
>>
>> I have a couple of questions:
>>
>> 1) As far as I can see, there haven't been any changes in mm/swapfile.c
>> that would cause this slowdown, so something in the block layer has
>> (from my point of view) regressed. Is this a known issue?
>
> Perhaps ATA TRIM is enabled for this SSD in 2.6.35 but not in 2.6.34?
> Or the discard code has been changed to issue many moderately sized ATA
> TRIMs instead of a single huge one, and the former was much more optimal
> for your particular SSD?

Mmmm. Wonder how I tell. Something in dmesg or hdparm -I?

ata3.00: ATA-8: ARSSD56GBP, 1916, max UDMA/133
ata3.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
ata3.00: configured for UDMA/133
scsi 2:0:0:0: Direct-Access     ATA      ARSSD56GBP       1916 PQ: 0 ANSI: 5
sd 2:0:0:0: Attached scsi generic sg1 type 0
sd 2:0:0:0: [sda] 500118192 512-byte logical blocks: (256 GB/238 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't 
support DPO or FUA
sda: sda1 sda2 sda3 sda4
sd 2:0:0:0: [sda] Attached SCSI disk

/dev/sda:

ATA device, with non-removable media
	Model Number:       ARSSD56GBP
	Serial Number:      DC2210200F1B40015
	Firmware Revision:  1916
Standards:
	Supported: 8 7 6 5
	Likely used: 8
Configuration:
	Logical		max	current
	cylinders	16383	16383
	heads		16	16
	sectors/track	63	63
	--
	CHS current addressable sectors:   16514064
	LBA    user addressable sectors:  268435455
	LBA48  user addressable sectors:  500118192
	Logical  Sector size:                   512 bytes
	Physical Sector size:                   512 bytes
	device size with M = 1024*1024:      244198 MBytes
	device size with M = 1000*1000:      256060 MBytes (256 GB)
	cache/buffer size  = unknown
	Nominal Media Rotation Rate: Solid State Device
Capabilities:
	LBA, IORDY(can be disabled)
	Queue depth: 32
	Standby timer values: spec'd by Standard, no device specific minimum
	R/W multiple sector transfer: Max = 1	Current = 1
	DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
	     Cycle time: min=120ns recommended=120ns
	PIO: pio0 pio1 pio2 pio3 pio4
	     Cycle time: no flow control=120ns  IORDY flow control=120ns
Commands/features:
	Enabled	Supported:
	   *	SMART feature set
	    	Security Mode feature set
	   *	Power Management feature set
	   *	Write cache
	   *	Look-ahead
	   *	Host Protected Area feature set
	   *	WRITE_BUFFER command
	   *	READ_BUFFER command
	   *	DOWNLOAD_MICROCODE
	    	SET_MAX security extension
	   *	48-bit Address feature set
	   *	Device Configuration Overlay feature set
	   *	Mandatory FLUSH_CACHE
	   *	FLUSH_CACHE_EXT
	   *	SMART self-test
	   *	General Purpose Logging feature set
	   *	Gen1 signaling speed (1.5Gb/s)
	   *	Gen2 signaling speed (3.0Gb/s)
	   *	Native Command Queueing (NCQ)
	   *	Phy event counters
	   *	DMA Setup Auto-Activate optimization
	    	Device-initiated interface power management
	   *	Software settings preservation
	   *	Data Set Management determinate TRIM supported
Security:
		supported
	not	enabled
	not	locked
		frozen
	not	expired: security count
	not	supported: enhanced erase
Checksum: correct


>> 2) Why are we calling discard_swap_cluster anyway? The swap was unused
>> and we're allocating it. I could understand calling it when freeing
>> swap, but when allocating?
>
> At the moment when the administrator creates swap space, the kernel can
> assume that he has no use anymore for the data that may have existed
> previously at this space.  Hence instruct the SSD's flash translation
> layer to return all these blocks to the list of unused logical blocks
> which do not have to be read and backed up whenever another logical
> block within the same erase block is written to.
>
> However, I am surprised that this is done every time (?) when preparing
> for hibernation.

It's not hibernation per se. The discard code is called from a few 
places in swapfile.c in (afaict from a quick scan) both swap allocation 
and free paths.

Regards,

Nigel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-04  1:40 2.6.35 Regression: Ages spent discarding blocks that weren't used! Nigel Cunningham
  2010-08-04  8:59 ` Stefan Richter
@ 2010-08-04  8:59 ` Stefan Richter
  2010-08-04 12:44 ` Mark Lord
  2010-08-04 12:44 ` Mark Lord
  3 siblings, 0 replies; 30+ messages in thread
From: Stefan Richter @ 2010-08-04  8:59 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: linux-pm, linux-kernel, linux-scsi

(adding Cc: linux-scsi)

Nigel Cunningham wrote:
> I've just given hibernation a go under 2.6.35, and at first I thought
> there was some sort of hang in freezing processes. The computer sat
> there for aaaaaages, apparently doing nothing. Switched from TuxOnIce to
> swsusp to see if it was specific to my code but no - the problem was
> there too. I used the nifty new kdb support to get a backtrace, which was:
> 
> get_swap_page_of_type
> discard_swap_cluster
> blk_dev_issue_discard
> wait_for_completion
> 
> Adding a printk in discard swap cluster gives the following:
> 
> [   46.758330] Discarding 256 pages from bdev 800003 beginning at page 640377.
> [   47.003363] Discarding 256 pages from bdev 800003 beginning at page 640633.
> [   47.246514] Discarding 256 pages from bdev 800003 beginning at page 640889.
> 
> ...
> 
> [  221.877465] Discarding 256 pages from bdev 800003 beginning at page 826745.
> [  222.121284] Discarding 256 pages from bdev 800003 beginning at page 827001.
> [  222.365908] Discarding 256 pages from bdev 800003 beginning at page 827257.
> [  222.610311] Discarding 256 pages from bdev 800003 beginning at page 827513.
> 
> So allocating 4GB of swap on my SSD now takes 176 seconds instead of
> virtually no time at all. (This code is completely unchanged from 2.6.34).
> 
> I have a couple of questions:
> 
> 1) As far as I can see, there haven't been any changes in mm/swapfile.c
> that would cause this slowdown, so something in the block layer has
> (from my point of view) regressed. Is this a known issue?

Perhaps ATA TRIM is enabled for this SSD in 2.6.35 but not in 2.6.34?
Or the discard code has been changed to issue many moderately sized ATA
TRIMs instead of a single huge one, and the former was much more optimal
for your particular SSD?

> 2) Why are we calling discard_swap_cluster anyway? The swap was unused
> and we're allocating it. I could understand calling it when freeing
> swap, but when allocating?

At the moment when the administrator creates swap space, the kernel can
assume that he has no use anymore for the data that may have existed
previously at this space.  Hence instruct the SSD's flash translation
layer to return all these blocks to the list of unused logical blocks
which do not have to be read and backed up whenever another logical
block within the same erase block is written to.

However, I am surprised that this is done every time (?) when preparing
for hibernation.
-- 
Stefan Richter
-=====-==-=- =--- --=--
http://arcgraph.de/sr/

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-04  1:40 2.6.35 Regression: Ages spent discarding blocks that weren't used! Nigel Cunningham
  2010-08-04  8:59 ` Stefan Richter
  2010-08-04  8:59 ` Stefan Richter
@ 2010-08-04 12:44 ` Mark Lord
  2010-08-04 12:44 ` Mark Lord
  3 siblings, 0 replies; 30+ messages in thread
From: Mark Lord @ 2010-08-04 12:44 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: pm list, LKML

Looks to me like more and more things are using the block discard
functionality, and as predicted it is slowing things down enormously.

The problem is that we still only discard tiny bits (a single range still??)
per TRIM command, rather than batching larger ranges and larger numbers
of ranges into single TRIM commands.

That's a very poor implementation, especially when things start enabling
it by default.  Eg. the swap code, mke2fs, etc..

Ugh.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-04  1:40 2.6.35 Regression: Ages spent discarding blocks that weren't used! Nigel Cunningham
                   ` (2 preceding siblings ...)
  2010-08-04 12:44 ` Mark Lord
@ 2010-08-04 12:44 ` Mark Lord
  2010-08-04 18:02   ` Martin K. Petersen
                     ` (3 more replies)
  3 siblings, 4 replies; 30+ messages in thread
From: Mark Lord @ 2010-08-04 12:44 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: LKML, pm list

Looks to me like more and more things are using the block discard
functionality, and as predicted it is slowing things down enormously.

The problem is that we still only discard tiny bits (a single range still??)
per TRIM command, rather than batching larger ranges and larger numbers
of ranges into single TRIM commands.

That's a very poor implementation, especially when things start enabling
it by default.  Eg. the swap code, mke2fs, etc..

Ugh.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-04 12:44 ` Mark Lord
@ 2010-08-04 18:02   ` Martin K. Petersen
  2010-08-04 18:02   ` Martin K. Petersen
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Martin K. Petersen @ 2010-08-04 18:02 UTC (permalink / raw)
  To: Mark Lord; +Cc: pm list, LKML

>>>>> "Mark" == Mark Lord <kernel@teksavvy.com> writes:

Mark> Looks to me like more and more things are using the block discard
Mark> functionality, and as predicted it is slowing things down
Mark> enormously.

Mark> The problem is that we still only discard tiny bits (a single
Mark> range still??)  per TRIM command, rather than batching larger
Mark> ranges and larger numbers of ranges into single TRIM commands.

Mark> That's a very poor implementation, especially when things start
Mark> enabling it by default.  Eg. the swap code, mke2fs, etc..

I'm working on aggregation.  But it's harder than we initially
thought...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-04 12:44 ` Mark Lord
  2010-08-04 18:02   ` Martin K. Petersen
@ 2010-08-04 18:02   ` Martin K. Petersen
  2010-08-04 21:22   ` Nigel Cunningham
  2010-08-04 21:22   ` Nigel Cunningham
  3 siblings, 0 replies; 30+ messages in thread
From: Martin K. Petersen @ 2010-08-04 18:02 UTC (permalink / raw)
  To: Mark Lord; +Cc: Nigel Cunningham, LKML, pm list

>>>>> "Mark" == Mark Lord <kernel@teksavvy.com> writes:

Mark> Looks to me like more and more things are using the block discard
Mark> functionality, and as predicted it is slowing things down
Mark> enormously.

Mark> The problem is that we still only discard tiny bits (a single
Mark> range still??)  per TRIM command, rather than batching larger
Mark> ranges and larger numbers of ranges into single TRIM commands.

Mark> That's a very poor implementation, especially when things start
Mark> enabling it by default.  Eg. the swap code, mke2fs, etc..

I'm working on aggregation.  But it's harder than we initially
thought...

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-04 12:44 ` Mark Lord
  2010-08-04 18:02   ` Martin K. Petersen
  2010-08-04 18:02   ` Martin K. Petersen
@ 2010-08-04 21:22   ` Nigel Cunningham
  2010-08-04 21:22   ` Nigel Cunningham
  3 siblings, 0 replies; 30+ messages in thread
From: Nigel Cunningham @ 2010-08-04 21:22 UTC (permalink / raw)
  To: Mark Lord; +Cc: pm list, LKML

Hi.

On 04/08/10 22:44, Mark Lord wrote:
> Looks to me like more and more things are using the block discard
> functionality, and as predicted it is slowing things down enormously.
>
> The problem is that we still only discard tiny bits (a single range
> still??)
> per TRIM command, rather than batching larger ranges and larger numbers
> of ranges into single TRIM commands.
>
> That's a very poor implementation, especially when things start enabling
> it by default. Eg. the swap code, mke2fs, etc..
>
> Ugh.

I was hoping for a nice quick and simple answer. Since I haven't got 
one, I'll try to find time to do a git bisect. I think I'll also look at 
the swap code more carefully and see if it's doing the sensible thing. I 
can't (at the moment) see the logic behind calling discard when 
allocating swap. At freeing time makes much more sense to me.

Regards,

Nigel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-04 12:44 ` Mark Lord
                     ` (2 preceding siblings ...)
  2010-08-04 21:22   ` Nigel Cunningham
@ 2010-08-04 21:22   ` Nigel Cunningham
  2010-08-05  3:58     ` Hugh Dickins
  2010-08-05  3:58     ` Hugh Dickins
  3 siblings, 2 replies; 30+ messages in thread
From: Nigel Cunningham @ 2010-08-04 21:22 UTC (permalink / raw)
  To: Mark Lord; +Cc: LKML, pm list

Hi.

On 04/08/10 22:44, Mark Lord wrote:
> Looks to me like more and more things are using the block discard
> functionality, and as predicted it is slowing things down enormously.
>
> The problem is that we still only discard tiny bits (a single range
> still??)
> per TRIM command, rather than batching larger ranges and larger numbers
> of ranges into single TRIM commands.
>
> That's a very poor implementation, especially when things start enabling
> it by default. Eg. the swap code, mke2fs, etc..
>
> Ugh.

I was hoping for a nice quick and simple answer. Since I haven't got 
one, I'll try to find time to do a git bisect. I think I'll also look at 
the swap code more carefully and see if it's doing the sensible thing. I 
can't (at the moment) see the logic behind calling discard when 
allocating swap. At freeing time makes much more sense to me.

Regards,

Nigel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't  used!
  2010-08-04 21:22   ` Nigel Cunningham
@ 2010-08-05  3:58     ` Hugh Dickins
  2010-08-05  6:28       ` Nigel Cunningham
  2010-08-05  6:28       ` Nigel Cunningham
  2010-08-05  3:58     ` Hugh Dickins
  1 sibling, 2 replies; 30+ messages in thread
From: Hugh Dickins @ 2010-08-05  3:58 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Mark Lord, LKML, pm list

On Wed, Aug 4, 2010 at 2:22 PM, Nigel Cunningham <nigel@tuxonice.net> wrote:
> On 04/08/10 22:44, Mark Lord wrote:
>>
>> Looks to me like more and more things are using the block discard
>> functionality, and as predicted it is slowing things down enormously.
>>
>> The problem is that we still only discard tiny bits (a single range
>> still??)
>> per TRIM command, rather than batching larger ranges and larger numbers
>> of ranges into single TRIM commands.
>>
>> That's a very poor implementation, especially when things start enabling
>> it by default. Eg. the swap code, mke2fs, etc..
>>
>> Ugh.

swap has been discarding since 2.6.29, on one 1MB range at a time.
There's been no significant change at the swap end since then, but I
guess more devices have been announcing themselves as nonrotational
and supporting discard, and the implementation lower down has gone
through a number of changes.

>
> I was hoping for a nice quick and simple answer. Since I haven't got one,
> I'll try to find time to do a git bisect. I think I'll also look at the swap
> code more carefully and see if it's doing the sensible thing. I can't (at
> the moment) see the logic behind calling discard when allocating swap. At
> freeing time makes much more sense to me.

I agree it would make more sense to discard swap when freeing rather
than when allocating, I wish we could.  But at the freeing point we're
often holding a page_table spinlock at an outer level, and it's just
one page we're given to free.  Freeing is an operation you want to be
comfortable doing when you're short of resources, whereas discard is a
kind of I/O operation which needs resources.

It happens that in the allocation path, there was already a place at
which we scanned for a cluster of 1MB free (I'm thinking of 4kB pages
when I say 1MB), so that was the neatest point at which to site the
discard - though even there we have to be careful about racing
allocations.

I did once try to go back and get it to work when freeing instead of
allocating, gathering the swap slots up then freeing when convenient.
It was messy, didn't work very well, and didn't show an improvement in
performance (on what we were testing at the time).

I've not been able to test swap, with SSDs, for several months: that's
a dreadful regression you've found, thanks a lot for reporting it:
I'll be very interested to hear where you locate the cause.  If it
needs changes to the way swap does discard, so be it.

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-05  3:58     ` Hugh Dickins
@ 2010-08-05  6:28       ` Nigel Cunningham
  2010-08-06  1:15         ` Hugh Dickins
  2010-08-06  1:15         ` Hugh Dickins
  2010-08-05  6:28       ` Nigel Cunningham
  1 sibling, 2 replies; 30+ messages in thread
From: Nigel Cunningham @ 2010-08-05  6:28 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Mark Lord, LKML, pm list

Hi Hugh.

Thanks for the email.

On 05/08/10 13:58, Hugh Dickins wrote:
> On Wed, Aug 4, 2010 at 2:22 PM, Nigel Cunningham<nigel@tuxonice.net>  wrote:
>> On 04/08/10 22:44, Mark Lord wrote:
>>>
>>> Looks to me like more and more things are using the block discard
>>> functionality, and as predicted it is slowing things down enormously.
>>>
>>> The problem is that we still only discard tiny bits (a single range
>>> still??)
>>> per TRIM command, rather than batching larger ranges and larger numbers
>>> of ranges into single TRIM commands.
>>>
>>> That's a very poor implementation, especially when things start enabling
>>> it by default. Eg. the swap code, mke2fs, etc..
>>>
>>> Ugh.
>
> swap has been discarding since 2.6.29, on one 1MB range at a time.
> There's been no significant change at the swap end since then, but I
> guess more devices have been announcing themselves as nonrotational
> and supporting discard, and the implementation lower down has gone
> through a number of changes.

Okay; that's good to know.

>>
>> I was hoping for a nice quick and simple answer. Since I haven't got one,
>> I'll try to find time to do a git bisect. I think I'll also look at the swap
>> code more carefully and see if it's doing the sensible thing. I can't (at
>> the moment) see the logic behind calling discard when allocating swap. At
>> freeing time makes much more sense to me.
>
> I agree it would make more sense to discard swap when freeing rather
> than when allocating, I wish we could.  But at the freeing point we're
> often holding a page_table spinlock at an outer level, and it's just
> one page we're given to free.  Freeing is an operation you want to be
> comfortable doing when you're short of resources, whereas discard is a
> kind of I/O operation which needs resources.
>
> It happens that in the allocation path, there was already a place at
> which we scanned for a cluster of 1MB free (I'm thinking of 4kB pages
> when I say 1MB), so that was the neatest point at which to site the
> discard - though even there we have to be careful about racing
> allocations.

Makes sense when you put it like that :)

I know it's a bit messier, but would it be possible for us to modify the 
behaviour depending on the reason for the allocation? (No page_table 
spinlock holding when we're hibernating).

The issue isn't as noticable with [u]swsusp at the moment because 
they're allocating swap as the image is being written. If my current set 
of patches for Rafael get accepted, that will change (swap will be 
preallocated).

TuxOnIce always allocates all available storage since there's (usually) 
virtually zero cost of doing so and it then doesn't matter how much the 
drivers allocate when we do the atomic copy, or how good a compression 
ratio is achieved. That's what I'm aiming for in my patches for [u]swsusp.

> I did once try to go back and get it to work when freeing instead of
> allocating, gathering the swap slots up then freeing when convenient.
> It was messy, didn't work very well, and didn't show an improvement in
> performance (on what we were testing at the time).

For one or two at a time, I can see that would be the case. If it is 
possible to do the discard of pages used for hibernation after we're 
finished reading the image, that would be good. Even better would be to 
only do the discard for pages that were actually used and just do a 
simple free for ones that were only allocated.

Of course I'm talking in ideals without having an intimate knowledge of 
the swap allocation code or exactly how ugly the above would make it :)

> I've not been able to test swap, with SSDs, for several months: that's
> a dreadful regression you've found, thanks a lot for reporting it:
> I'll be very interested to hear where you locate the cause.  If it
> needs changes to the way swap does discard, so be it.

I'm traveling to the US on Saturday and have apparently been given one 
of those nice seats with power, so I'll try and get the bisection done then.

TTFN

Nigel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-05  6:28       ` Nigel Cunningham
@ 2010-08-06  1:15         ` Hugh Dickins
  2010-08-06  1:15         ` Hugh Dickins
  1 sibling, 0 replies; 30+ messages in thread
From: Hugh Dickins @ 2010-08-06  1:15 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Mark Lord, LKML, pm list

On Wed, Aug 4, 2010 at 11:28 PM, Nigel Cunningham <nigel@tuxonice.net> wrote:
> On 05/08/10 13:58, Hugh Dickins wrote:
>> I agree it would make more sense to discard swap when freeing rather
>> than when allocating, I wish we could.  But at the freeing point we're
>> often holding a page_table spinlock at an outer level, and it's just
>> one page we're given to free.  Freeing is an operation you want to be
>> comfortable doing when you're short of resources, whereas discard is a
>> kind of I/O operation which needs resources.
>>
>> It happens that in the allocation path, there was already a place at
>> which we scanned for a cluster of 1MB free (I'm thinking of 4kB pages
>> when I say 1MB), so that was the neatest point at which to site the
>> discard - though even there we have to be careful about racing
>> allocations.
>
> Makes sense when you put it like that :)
>
> I know it's a bit messier, but would it be possible for us to modify the
> behaviour depending on the reason for the allocation? (No page_table
> spinlock holding when we're hibernating).

But if we moved it to the swap free, it would occur in the swap free
(if any) _prior_ to hibernating, when testing for hibernation would
just say "no".

>> I did once try to go back and get it to work when freeing instead of
>> allocating, gathering the swap slots up then freeing when convenient.
>> It was messy, didn't work very well, and didn't show an improvement in
>> performance (on what we were testing at the time).

When I tried gathering together the frees, there were just too many
short extents to make the discards worth doing that  way.

>
> For one or two at a time, I can see that would be the case. If it is
> possible to do the discard of pages used for hibernation after we're
> finished reading the image, that would be good. Even better would be to only
> do the discard for pages that were actually used and just do a simple free
> for ones that were only allocated.

There are optimizations which could be done e.g. we discard the whole
of swap at swapon time, then re-discard each 1MB as we begin to
allocate from it.  Clearly that has a certain stupidity to it!  But
the initial discard of the whole of swap should be efficient and worth
doing.  We could keep track of which swap pages have already been
discarded since last used, but that would take up another... it's not
immediately clear to me whether it's another value or another bit of
the swap count.

We could provide an interface for hibernation, to do a minimal number
of maximal range discards to suit hibernation (but we need to be very
careful about ongoing allocation, as in another thread).

But are these things worth doing?  I think not.  Discard is supposed
to be helping not hindering: if the device only supports discard in a
way that's so inefficient, we'd do better to blacklist it as not
supporting discard at all.  My suspicion is that your SSD is of that
kind: that it used not to be recognized as supporting discard, but now
in 2.6.35 it is so recognized.  However, that's just a suspicion: let
me not slander your SSD when it may be my code or someone else's to
blame: needs testing.

>
> Of course I'm talking in ideals without having an intimate knowledge of the
> swap allocation code or exactly how ugly the above would make it :)
>
>> I've not been able to test swap, with SSDs, for several months: that's
>> a dreadful regression you've found, thanks a lot for reporting it:
>> I'll be very interested to hear where you locate the cause.  If it
>> needs changes to the way swap does discard, so be it.
>
> I'm traveling to the US on Saturday and have apparently been given one of
> those nice seats with power, so I'll try and get the bisection done then.

That would be helpful, but displays greater dedication than I'd offer myself!

Hugh
_______________________________________________
linux-pm mailing list
linux-pm@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/linux-pm

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't  used!
  2010-08-05  6:28       ` Nigel Cunningham
  2010-08-06  1:15         ` Hugh Dickins
@ 2010-08-06  1:15         ` Hugh Dickins
  2010-08-06  4:40           ` Nigel Cunningham
  2010-08-06  4:40           ` Nigel Cunningham
  1 sibling, 2 replies; 30+ messages in thread
From: Hugh Dickins @ 2010-08-06  1:15 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Mark Lord, LKML, pm list

On Wed, Aug 4, 2010 at 11:28 PM, Nigel Cunningham <nigel@tuxonice.net> wrote:
> On 05/08/10 13:58, Hugh Dickins wrote:
>> I agree it would make more sense to discard swap when freeing rather
>> than when allocating, I wish we could.  But at the freeing point we're
>> often holding a page_table spinlock at an outer level, and it's just
>> one page we're given to free.  Freeing is an operation you want to be
>> comfortable doing when you're short of resources, whereas discard is a
>> kind of I/O operation which needs resources.
>>
>> It happens that in the allocation path, there was already a place at
>> which we scanned for a cluster of 1MB free (I'm thinking of 4kB pages
>> when I say 1MB), so that was the neatest point at which to site the
>> discard - though even there we have to be careful about racing
>> allocations.
>
> Makes sense when you put it like that :)
>
> I know it's a bit messier, but would it be possible for us to modify the
> behaviour depending on the reason for the allocation? (No page_table
> spinlock holding when we're hibernating).

But if we moved it to the swap free, it would occur in the swap free
(if any) _prior_ to hibernating, when testing for hibernation would
just say "no".

>> I did once try to go back and get it to work when freeing instead of
>> allocating, gathering the swap slots up then freeing when convenient.
>> It was messy, didn't work very well, and didn't show an improvement in
>> performance (on what we were testing at the time).

When I tried gathering together the frees, there were just too many
short extents to make the discards worth doing that  way.

>
> For one or two at a time, I can see that would be the case. If it is
> possible to do the discard of pages used for hibernation after we're
> finished reading the image, that would be good. Even better would be to only
> do the discard for pages that were actually used and just do a simple free
> for ones that were only allocated.

There are optimizations which could be done e.g. we discard the whole
of swap at swapon time, then re-discard each 1MB as we begin to
allocate from it.  Clearly that has a certain stupidity to it!  But
the initial discard of the whole of swap should be efficient and worth
doing.  We could keep track of which swap pages have already been
discarded since last used, but that would take up another... it's not
immediately clear to me whether it's another value or another bit of
the swap count.

We could provide an interface for hibernation, to do a minimal number
of maximal range discards to suit hibernation (but we need to be very
careful about ongoing allocation, as in another thread).

But are these things worth doing?  I think not.  Discard is supposed
to be helping not hindering: if the device only supports discard in a
way that's so inefficient, we'd do better to blacklist it as not
supporting discard at all.  My suspicion is that your SSD is of that
kind: that it used not to be recognized as supporting discard, but now
in 2.6.35 it is so recognized.  However, that's just a suspicion: let
me not slander your SSD when it may be my code or someone else's to
blame: needs testing.

>
> Of course I'm talking in ideals without having an intimate knowledge of the
> swap allocation code or exactly how ugly the above would make it :)
>
>> I've not been able to test swap, with SSDs, for several months: that's
>> a dreadful regression you've found, thanks a lot for reporting it:
>> I'll be very interested to hear where you locate the cause.  If it
>> needs changes to the way swap does discard, so be it.
>
> I'm traveling to the US on Saturday and have apparently been given one of
> those nice seats with power, so I'll try and get the bisection done then.

That would be helpful, but displays greater dedication than I'd offer myself!

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-06  1:15         ` Hugh Dickins
@ 2010-08-06  4:40           ` Nigel Cunningham
  2010-08-06 22:07             ` Hugh Dickins
  2010-08-06 22:07             ` Hugh Dickins
  2010-08-06  4:40           ` Nigel Cunningham
  1 sibling, 2 replies; 30+ messages in thread
From: Nigel Cunningham @ 2010-08-06  4:40 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Mark Lord, LKML, pm list

Hi.

On 06/08/10 11:15, Hugh Dickins wrote:
> On Wed, Aug 4, 2010 at 11:28 PM, Nigel Cunningham<nigel@tuxonice.net>  wrote:
>> On 05/08/10 13:58, Hugh Dickins wrote:
>>> I agree it would make more sense to discard swap when freeing rather
>>> than when allocating, I wish we could.  But at the freeing point we're
>>> often holding a page_table spinlock at an outer level, and it's just
>>> one page we're given to free.  Freeing is an operation you want to be
>>> comfortable doing when you're short of resources, whereas discard is a
>>> kind of I/O operation which needs resources.

The more I think about this, the more it seems to me that doing the 
discard at allocation time is just wrong. How about a strategy in which 
you do the discard immediately if resources permit, or flag it as in 
need of doing (at a future swap free of that page or swap off time) if 
things are too pressured at the moment? I haven't put thought into what 
data structures could be used for that - just want to ask for now if 
you'd be happy with the idea of looking into a strategy like that.

>>> It happens that in the allocation path, there was already a place at
>>> which we scanned for a cluster of 1MB free (I'm thinking of 4kB pages
>>> when I say 1MB), so that was the neatest point at which to site the
>>> discard - though even there we have to be careful about racing
>>> allocations.
>>
>> Makes sense when you put it like that :)
>>
>> I know it's a bit messier, but would it be possible for us to modify the
>> behaviour depending on the reason for the allocation? (No page_table
>> spinlock holding when we're hibernating).
>
> But if we moved it to the swap free, it would occur in the swap free
> (if any) _prior_ to hibernating, when testing for hibernation would
> just say "no".

I see what you mean. I'm not so worried by that because if we're having 
to free swap in order to hibernate, things are going to be slow anyway 
and discards aren't going to stand out nearly so much.

I'm much more concerned by the fact that we're currently doing discards 
for swap that wasn't even being used. If I reboot, swapon and then 
hibernate without having touched swap so far, I still get the 'hit' on 
the whole 4GB. I don't think we should be storing on disk a "this hasn't 
been used since last discarded" flag, so I'm looking for other solutions.

>>> I did once try to go back and get it to work when freeing instead of
>>> allocating, gathering the swap slots up then freeing when convenient.
>>> It was messy, didn't work very well, and didn't show an improvement in
>>> performance (on what we were testing at the time).
>
> When I tried gathering together the frees, there were just too many
> short extents to make the discards worth doing that  way.

Okay.

>> For one or two at a time, I can see that would be the case. If it is
>> possible to do the discard of pages used for hibernation after we're
>> finished reading the image, that would be good. Even better would be to only
>> do the discard for pages that were actually used and just do a simple free
>> for ones that were only allocated.
>
> There are optimizations which could be done e.g. we discard the whole
> of swap at swapon time, then re-discard each 1MB as we begin to
> allocate from it.  Clearly that has a certain stupidity to it!  But
> the initial discard of the whole of swap should be efficient and worth
> doing.  We could keep track of which swap pages have already been
> discarded since last used, but that would take up another... it's not
> immediately clear to me whether it's another value or another bit of
> the swap count.
>
> We could provide an interface for hibernation, to do a minimal number
> of maximal range discards to suit hibernation (but we need to be very
> careful about ongoing allocation, as in another thread).
>
> But are these things worth doing?  I think not.  Discard is supposed
> to be helping not hindering: if the device only supports discard in a
> way that's so inefficient, we'd do better to blacklist it as not
> supporting discard at all.  My suspicion is that your SSD is of that
> kind: that it used not to be recognized as supporting discard, but now
> in 2.6.35 it is so recognized.  However, that's just a suspicion: let
> me not slander your SSD when it may be my code or someone else's to
> blame: needs testing.

I've done the bisect now - spent the time today instead of on the place, 
and it took me to fbbf055692aeb25c54c49d9ca84532de836fbba0. This is 
Christoph's Hellwig's patch "block: fix DISCARD_BARRIER requests".

Could the problem be that we're using DISCARD_FL_BARRIER too much? (I'm 
still looking at the code here, so am writing without having thought 
everything through _really_ carefully.

>> Of course I'm talking in ideals without having an intimate knowledge of the
>> swap allocation code or exactly how ugly the above would make it :)
>>
>>> I've not been able to test swap, with SSDs, for several months: that's
>>> a dreadful regression you've found, thanks a lot for reporting it:
>>> I'll be very interested to hear where you locate the cause.  If it
>>> needs changes to the way swap does discard, so be it.
>>
>> I'm traveling to the US on Saturday and have apparently been given one of
>> those nice seats with power, so I'll try and get the bisection done then.
>
> That would be helpful, but displays greater dedication than I'd offer myself!

Done without an airplane now. I don't have to be credited with too much 
dedication! :)

Nigel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-06  4:40           ` Nigel Cunningham
@ 2010-08-06 22:07             ` Hugh Dickins
  2010-08-06 22:07             ` Hugh Dickins
  1 sibling, 0 replies; 30+ messages in thread
From: Hugh Dickins @ 2010-08-06 22:07 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Christoph Hellwig, Mark Lord, LKML, pm list

On Thu, Aug 5, 2010 at 9:40 PM, Nigel Cunningham <nigel@tuxonice.net> wrote:
> On 06/08/10 11:15, Hugh Dickins wrote:
>> On Wed, Aug 4, 2010 at 11:28 PM, Nigel Cunningham<nigel@tuxonice.net>
>>  wrote:
>>>>> On 05/08/10 13:58, Hugh Dickins wrote:
>>>>
>>>> I agree it would make more sense to discard swap when freeing rather
>>>> than when allocating, I wish we could.  But at the freeing point we're
>>>> often holding a page_table spinlock at an outer level, and it's just
>>>> one page we're given to free.  Freeing is an operation you want to be
>>>> comfortable doing when you're short of resources, whereas discard is a
>>>> kind of I/O operation which needs resources.
>
> The more I think about this, the more it seems to me that doing the discard
> at allocation time is just wrong. How about a strategy in which you do the
> discard immediately if resources permit, or flag it as in need of doing (at
> a future swap free of that page or swap off time) if things are too
> pressured at the moment? I haven't put thought into what data structures
> could be used for that - just want to ask for now if you'd be happy with the
> idea of looking into a strategy like that.

We're going round in circles here.  I have already agreed with you
that doing it at swap free time seems more natural, but explained that
there are implementation difficulties there, so doing it at allocation
time proved both much less messy and more efficient.  I can imagine
advances that would sway me to putting in more effort there
(duplicating the scan for a free 1MB every time a page of swap is
freed? doesn't sound efficient to me, but if it saves us from an
inefficiency of too many or too late discards, perhaps it could work
out).  However, a recently introduced regression does not make a
strong case for adding in hacks - not yet anyway.

> I've done the bisect now - spent the time today instead of on the place, and

That's great, many thanks!

> it took me to fbbf055692aeb25c54c49d9ca84532de836fbba0. This is Christoph's
> Hellwig's patch "block: fix DISCARD_BARRIER requests".

That's
    block: fix DISCARD_BARRIER requests

    Filesystems assume that DISCARD_BARRIER are full barriers, so that they
    don't have to track in-progress discard operation when submitting new I/O.
    But currently we only treat them as elevator barriers, which don't
    actually do the nessecary queue drains.

where the discard barrier changes from REQ_SOFTBARRIER to REQ_HARDBARRIER.

If REQ_SOFTBARRIER means that the device is still free to reorder a
write, which was issued after discard completion was reported, before
the discard (so later discarding the data written), then certainly I
agree with Christoph (now Cc'ed) that the REQ_HARDBARRIER is
unavoidable there; but if not, then it's not needed for the swap case.
 I hope to gain a little more enlightenment on such barriers shortly.

What does seem over the top to me, is for mm/swapfile.c's
blkdev_issue_discard()s to be asking for both BLKDEV_IFL_WAIT and
BLKDEV_IFL_BARRIER: those swap discards were originally written just
to use barriers, without needing to wait for completion in there.  I'd
be interested to hear if cutting out the BLKDEV_IFL_WAITs makes the
swap discards behave acceptably again for you - but understand that
you won't have a chance to try that until later next week.

Thanks,
Hugh
_______________________________________________
linux-pm mailing list
linux-pm@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/linux-pm

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't  used!
  2010-08-06  4:40           ` Nigel Cunningham
  2010-08-06 22:07             ` Hugh Dickins
@ 2010-08-06 22:07             ` Hugh Dickins
  2010-08-07 22:47               ` Nigel Cunningham
                                 ` (3 more replies)
  1 sibling, 4 replies; 30+ messages in thread
From: Hugh Dickins @ 2010-08-06 22:07 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Mark Lord, LKML, pm list, Christoph Hellwig

On Thu, Aug 5, 2010 at 9:40 PM, Nigel Cunningham <nigel@tuxonice.net> wrote:
> On 06/08/10 11:15, Hugh Dickins wrote:
>> On Wed, Aug 4, 2010 at 11:28 PM, Nigel Cunningham<nigel@tuxonice.net>
>>  wrote:
>>>>> On 05/08/10 13:58, Hugh Dickins wrote:
>>>>
>>>> I agree it would make more sense to discard swap when freeing rather
>>>> than when allocating, I wish we could.  But at the freeing point we're
>>>> often holding a page_table spinlock at an outer level, and it's just
>>>> one page we're given to free.  Freeing is an operation you want to be
>>>> comfortable doing when you're short of resources, whereas discard is a
>>>> kind of I/O operation which needs resources.
>
> The more I think about this, the more it seems to me that doing the discard
> at allocation time is just wrong. How about a strategy in which you do the
> discard immediately if resources permit, or flag it as in need of doing (at
> a future swap free of that page or swap off time) if things are too
> pressured at the moment? I haven't put thought into what data structures
> could be used for that - just want to ask for now if you'd be happy with the
> idea of looking into a strategy like that.

We're going round in circles here.  I have already agreed with you
that doing it at swap free time seems more natural, but explained that
there are implementation difficulties there, so doing it at allocation
time proved both much less messy and more efficient.  I can imagine
advances that would sway me to putting in more effort there
(duplicating the scan for a free 1MB every time a page of swap is
freed? doesn't sound efficient to me, but if it saves us from an
inefficiency of too many or too late discards, perhaps it could work
out).  However, a recently introduced regression does not make a
strong case for adding in hacks - not yet anyway.

> I've done the bisect now - spent the time today instead of on the place, and

That's great, many thanks!

> it took me to fbbf055692aeb25c54c49d9ca84532de836fbba0. This is Christoph's
> Hellwig's patch "block: fix DISCARD_BARRIER requests".

That's
    block: fix DISCARD_BARRIER requests

    Filesystems assume that DISCARD_BARRIER are full barriers, so that they
    don't have to track in-progress discard operation when submitting new I/O.
    But currently we only treat them as elevator barriers, which don't
    actually do the nessecary queue drains.

where the discard barrier changes from REQ_SOFTBARRIER to REQ_HARDBARRIER.

If REQ_SOFTBARRIER means that the device is still free to reorder a
write, which was issued after discard completion was reported, before
the discard (so later discarding the data written), then certainly I
agree with Christoph (now Cc'ed) that the REQ_HARDBARRIER is
unavoidable there; but if not, then it's not needed for the swap case.
 I hope to gain a little more enlightenment on such barriers shortly.

What does seem over the top to me, is for mm/swapfile.c's
blkdev_issue_discard()s to be asking for both BLKDEV_IFL_WAIT and
BLKDEV_IFL_BARRIER: those swap discards were originally written just
to use barriers, without needing to wait for completion in there.  I'd
be interested to hear if cutting out the BLKDEV_IFL_WAITs makes the
swap discards behave acceptably again for you - but understand that
you won't have a chance to try that until later next week.

Thanks,
Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-06 22:07             ` Hugh Dickins
@ 2010-08-07 22:47               ` Nigel Cunningham
  2010-08-07 22:47               ` Nigel Cunningham
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Nigel Cunningham @ 2010-08-07 22:47 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Mark Lord, LKML, pm list, Christoph Hellwig

Hi.

On 07/08/10 08:07, Hugh Dickins wrote:
> On Thu, Aug 5, 2010 at 9:40 PM, Nigel Cunningham<nigel@tuxonice.net>  wrote:
>> On 06/08/10 11:15, Hugh Dickins wrote:
>>> On Wed, Aug 4, 2010 at 11:28 PM, Nigel Cunningham<nigel@tuxonice.net>
>>>   wrote:
>>>>>> On 05/08/10 13:58, Hugh Dickins wrote:
>>>>>
>>>>> I agree it would make more sense to discard swap when freeing rather
>>>>> than when allocating, I wish we could.  But at the freeing point we're
>>>>> often holding a page_table spinlock at an outer level, and it's just
>>>>> one page we're given to free.  Freeing is an operation you want to be
>>>>> comfortable doing when you're short of resources, whereas discard is a
>>>>> kind of I/O operation which needs resources.
>>
>> The more I think about this, the more it seems to me that doing the discard
>> at allocation time is just wrong. How about a strategy in which you do the
>> discard immediately if resources permit, or flag it as in need of doing (at
>> a future swap free of that page or swap off time) if things are too
>> pressured at the moment? I haven't put thought into what data structures
>> could be used for that - just want to ask for now if you'd be happy with the
>> idea of looking into a strategy like that.
>
> We're going round in circles here.  I have already agreed with you
> that doing it at swap free time seems more natural, but explained that
> there are implementation difficulties there, so doing it at allocation
> time proved both much less messy and more efficient.  I can imagine
> advances that would sway me to putting in more effort there
> (duplicating the scan for a free 1MB every time a page of swap is
> freed? doesn't sound efficient to me, but if it saves us from an
> inefficiency of too many or too late discards, perhaps it could work
> out).  However, a recently introduced regression does not make a
> strong case for adding in hacks - not yet anyway.

Okay; sorry.

>> I've done the bisect now - spent the time today instead of on the place, and
>
> That's great, many thanks!
>
>> it took me to fbbf055692aeb25c54c49d9ca84532de836fbba0. This is Christoph's
>> Hellwig's patch "block: fix DISCARD_BARRIER requests".
>
> That's
>      block: fix DISCARD_BARRIER requests
>
>      Filesystems assume that DISCARD_BARRIER are full barriers, so that they
>      don't have to track in-progress discard operation when submitting new I/O.
>      But currently we only treat them as elevator barriers, which don't
>      actually do the nessecary queue drains.
>
> where the discard barrier changes from REQ_SOFTBARRIER to REQ_HARDBARRIER.
>
> If REQ_SOFTBARRIER means that the device is still free to reorder a
> write, which was issued after discard completion was reported, before
> the discard (so later discarding the data written), then certainly I
> agree with Christoph (now Cc'ed) that the REQ_HARDBARRIER is
> unavoidable there; but if not, then it's not needed for the swap case.
>   I hope to gain a little more enlightenment on such barriers shortly.
>
> What does seem over the top to me, is for mm/swapfile.c's
> blkdev_issue_discard()s to be asking for both BLKDEV_IFL_WAIT and
> BLKDEV_IFL_BARRIER: those swap discards were originally written just
> to use barriers, without needing to wait for completion in there.  I'd
> be interested to hear if cutting out the BLKDEV_IFL_WAITs makes the
> swap discards behave acceptably again for you - but understand that
> you won't have a chance to try that until later next week.

Well, I've just arrived at the hotel, and I want to delay going to bed 
until a 'proper' hour local time, so I expect I'll do it this evening - 
especially a one liner like that. Might even give it a go now...

Nigel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-06 22:07             ` Hugh Dickins
  2010-08-07 22:47               ` Nigel Cunningham
@ 2010-08-07 22:47               ` Nigel Cunningham
  2010-08-13 11:54               ` Christoph Hellwig
  2010-08-13 11:54               ` Christoph Hellwig
  3 siblings, 0 replies; 30+ messages in thread
From: Nigel Cunningham @ 2010-08-07 22:47 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Christoph Hellwig, Mark Lord, LKML, pm list

Hi.

On 07/08/10 08:07, Hugh Dickins wrote:
> On Thu, Aug 5, 2010 at 9:40 PM, Nigel Cunningham<nigel@tuxonice.net>  wrote:
>> On 06/08/10 11:15, Hugh Dickins wrote:
>>> On Wed, Aug 4, 2010 at 11:28 PM, Nigel Cunningham<nigel@tuxonice.net>
>>>   wrote:
>>>>>> On 05/08/10 13:58, Hugh Dickins wrote:
>>>>>
>>>>> I agree it would make more sense to discard swap when freeing rather
>>>>> than when allocating, I wish we could.  But at the freeing point we're
>>>>> often holding a page_table spinlock at an outer level, and it's just
>>>>> one page we're given to free.  Freeing is an operation you want to be
>>>>> comfortable doing when you're short of resources, whereas discard is a
>>>>> kind of I/O operation which needs resources.
>>
>> The more I think about this, the more it seems to me that doing the discard
>> at allocation time is just wrong. How about a strategy in which you do the
>> discard immediately if resources permit, or flag it as in need of doing (at
>> a future swap free of that page or swap off time) if things are too
>> pressured at the moment? I haven't put thought into what data structures
>> could be used for that - just want to ask for now if you'd be happy with the
>> idea of looking into a strategy like that.
>
> We're going round in circles here.  I have already agreed with you
> that doing it at swap free time seems more natural, but explained that
> there are implementation difficulties there, so doing it at allocation
> time proved both much less messy and more efficient.  I can imagine
> advances that would sway me to putting in more effort there
> (duplicating the scan for a free 1MB every time a page of swap is
> freed? doesn't sound efficient to me, but if it saves us from an
> inefficiency of too many or too late discards, perhaps it could work
> out).  However, a recently introduced regression does not make a
> strong case for adding in hacks - not yet anyway.

Okay; sorry.

>> I've done the bisect now - spent the time today instead of on the place, and
>
> That's great, many thanks!
>
>> it took me to fbbf055692aeb25c54c49d9ca84532de836fbba0. This is Christoph's
>> Hellwig's patch "block: fix DISCARD_BARRIER requests".
>
> That's
>      block: fix DISCARD_BARRIER requests
>
>      Filesystems assume that DISCARD_BARRIER are full barriers, so that they
>      don't have to track in-progress discard operation when submitting new I/O.
>      But currently we only treat them as elevator barriers, which don't
>      actually do the nessecary queue drains.
>
> where the discard barrier changes from REQ_SOFTBARRIER to REQ_HARDBARRIER.
>
> If REQ_SOFTBARRIER means that the device is still free to reorder a
> write, which was issued after discard completion was reported, before
> the discard (so later discarding the data written), then certainly I
> agree with Christoph (now Cc'ed) that the REQ_HARDBARRIER is
> unavoidable there; but if not, then it's not needed for the swap case.
>   I hope to gain a little more enlightenment on such barriers shortly.
>
> What does seem over the top to me, is for mm/swapfile.c's
> blkdev_issue_discard()s to be asking for both BLKDEV_IFL_WAIT and
> BLKDEV_IFL_BARRIER: those swap discards were originally written just
> to use barriers, without needing to wait for completion in there.  I'd
> be interested to hear if cutting out the BLKDEV_IFL_WAITs makes the
> swap discards behave acceptably again for you - but understand that
> you won't have a chance to try that until later next week.

Well, I've just arrived at the hotel, and I want to delay going to bed 
until a 'proper' hour local time, so I expect I'll do it this evening - 
especially a one liner like that. Might even give it a go now...

Nigel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-06 22:07             ` Hugh Dickins
  2010-08-07 22:47               ` Nigel Cunningham
  2010-08-07 22:47               ` Nigel Cunningham
@ 2010-08-13 11:54               ` Christoph Hellwig
  2010-08-13 18:15                 ` Hugh Dickins
  2010-08-13 18:15                 ` Hugh Dickins
  2010-08-13 11:54               ` Christoph Hellwig
  3 siblings, 2 replies; 30+ messages in thread
From: Christoph Hellwig @ 2010-08-13 11:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nigel Cunningham, Mark Lord, LKML, pm list, Christoph Hellwig

On Fri, Aug 06, 2010 at 03:07:25PM -0700, Hugh Dickins wrote:
> If REQ_SOFTBARRIER means that the device is still free to reorder a
> write, which was issued after discard completion was reported, before
> the discard (so later discarding the data written), then certainly I
> agree with Christoph (now Cc'ed) that the REQ_HARDBARRIER is
> unavoidable there; but if not, then it's not needed for the swap case.
>  I hope to gain a little more enlightenment on such barriers shortly.

REQ_SOFTBARRIER is indeed purely a reordering barrier inside the block
elevator.

> What does seem over the top to me, is for mm/swapfile.c's
> blkdev_issue_discard()s to be asking for both BLKDEV_IFL_WAIT and
> BLKDEV_IFL_BARRIER: those swap discards were originally written just
> to use barriers, without needing to wait for completion in there.  I'd
> be interested to hear if cutting out the BLKDEV_IFL_WAITs makes the
> swap discards behave acceptably again for you - but understand that
> you won't have a chance to try that until later next week.

That does indeed look incorrect to me.  Any kind of explicit waits
usually mean the caller provides ordering.  Getting rid of
BLKDEV_IFL_BARRIER in the swap code ASAP would indeed be beneficial
given that we are trying to get rid of hard barriers completely soon.
Auditing the existing blkdev_issue_discard callers in filesystems
is high on the todo list for me.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-13 11:54               ` Christoph Hellwig
@ 2010-08-13 18:15                 ` Hugh Dickins
  2010-08-14 11:43                   ` Christoph Hellwig
  2010-08-14 11:43                   ` Christoph Hellwig
  2010-08-13 18:15                 ` Hugh Dickins
  1 sibling, 2 replies; 30+ messages in thread
From: Hugh Dickins @ 2010-08-13 18:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Nigel Cunningham, Mark Lord, LKML, pm list, James Bottomley,
	Martin K. Petersen

On Fri, Aug 13, 2010 at 4:54 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Aug 06, 2010 at 03:07:25PM -0700, Hugh Dickins wrote:
>> If REQ_SOFTBARRIER means that the device is still free to reorder a
>> write, which was issued after discard completion was reported, before
>> the discard (so later discarding the data written), then certainly I
>> agree with Christoph (now Cc'ed) that the REQ_HARDBARRIER is
>> unavoidable there; but if not, then it's not needed for the swap case.
>>  I hope to gain a little more enlightenment on such barriers shortly.
>
> REQ_SOFTBARRIER is indeed purely a reordering barrier inside the block
> elevator.
>
>> What does seem over the top to me, is for mm/swapfile.c's
>> blkdev_issue_discard()s to be asking for both BLKDEV_IFL_WAIT and
>> BLKDEV_IFL_BARRIER: those swap discards were originally written just
>> to use barriers, without needing to wait for completion in there.  I'd
>> be interested to hear if cutting out the BLKDEV_IFL_WAITs makes the
>> swap discards behave acceptably again for you - but understand that
>> you won't have a chance to try that until later next week.
>
> That does indeed look incorrect to me.  Any kind of explicit waits
> usually mean the caller provides ordering.  Getting rid of
> BLKDEV_IFL_BARRIER in the swap code ASAP would indeed be beneficial
> given that we are trying to get rid of hard barriers completely soon.
> Auditing the existing blkdev_issue_discard callers in filesystems
> is high on the todo list for me.

Yes.

Above I was suggesting for Nigel to experiment with cutting out swap
discard's BLKDEV_IFL_WAITs - and the results of cutting those out but
leaving its BLKDEV_IFL_BARRIERs would still be interesting.  But after
digesting the LSF discussion and the email thread that led up to it, I
came to the same conclusion as you, that going forward we want to keep
its BLKDEV_IFL_WAITs (swapfile.c already provides all the other
synchronization for that to fit into - things like never freeing swap
while its still under writeback) and simply remove its
BLKDEV_IFL_BARRIERs.

However, I am still not quite sure that we can already make that
change for 2.6.35 (-stable).  Can you reassure me on the question I
raise above: if we issue a discard to a device with cache, wait for
"completion", then issue a write into the area spanned by that
discard, can we be certain that the write to backing store will not be
reordered before the discard of backing store (unless the device is
just broken)?  Without a  REQ_HARDBARRIER in the 2.6.35 scheme?  It
seems a very reasonable assumption to me, but I'm learning not to
depend upon reasonable assumptions here.  (By the way, it doesn't
matter at all whether writes not spanned by the discard pass it or
not.)

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-13 18:15                 ` Hugh Dickins
@ 2010-08-14 11:43                   ` Christoph Hellwig
  2010-08-14 11:43                   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2010-08-14 11:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: James Bottomley, Martin K. Petersen, LKML, Christoph Hellwig,
	Mark Lord, pm list

On Fri, Aug 13, 2010 at 11:15:38AM -0700, Hugh Dickins wrote:
> However, I am still not quite sure that we can already make that
> change for 2.6.35 (-stable).  Can you reassure me on the question I
> raise above: if we issue a discard to a device with cache, wait for
> "completion", then issue a write into the area spanned by that
> discard, can we be certain that the write to backing store will not be
> reordered before the discard of backing store (unless the device is
> just broken)?  Without a  REQ_HARDBARRIER in the 2.6.35 scheme?  It
> seems a very reasonable assumption to me, but I'm learning not to
> depend upon reasonable assumptions here.  (By the way, it doesn't
> matter at all whether writes not spanned by the discard pass it or
> not.)

Neither the SCSI (SPC and SBC) make the cache part of the protocol
except for the commands to commit them to non-volatile storage, so
even when reordering the backing device write it must still not
reorder them vs notified completion.  That's nothing specific to
discard, e.g. when a write was notified as complete a new read must
come from the cache even if it hasn't been commited to the backing
device.  Now I can't guarantee that all cheap SSD firmware
implementations gets thus right for TRIM, but if one is really
that buggy we need to blacklist it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-13 18:15                 ` Hugh Dickins
  2010-08-14 11:43                   ` Christoph Hellwig
@ 2010-08-14 11:43                   ` Christoph Hellwig
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2010-08-14 11:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Christoph Hellwig, Nigel Cunningham, Mark Lord, LKML, pm list,
	James Bottomley, Martin K. Petersen

On Fri, Aug 13, 2010 at 11:15:38AM -0700, Hugh Dickins wrote:
> However, I am still not quite sure that we can already make that
> change for 2.6.35 (-stable).  Can you reassure me on the question I
> raise above: if we issue a discard to a device with cache, wait for
> "completion", then issue a write into the area spanned by that
> discard, can we be certain that the write to backing store will not be
> reordered before the discard of backing store (unless the device is
> just broken)?  Without a  REQ_HARDBARRIER in the 2.6.35 scheme?  It
> seems a very reasonable assumption to me, but I'm learning not to
> depend upon reasonable assumptions here.  (By the way, it doesn't
> matter at all whether writes not spanned by the discard pass it or
> not.)

Neither the SCSI (SPC and SBC) make the cache part of the protocol
except for the commands to commit them to non-volatile storage, so
even when reordering the backing device write it must still not
reorder them vs notified completion.  That's nothing specific to
discard, e.g. when a write was notified as complete a new read must
come from the cache even if it hasn't been commited to the backing
device.  Now I can't guarantee that all cheap SSD firmware
implementations gets thus right for TRIM, but if one is really
that buggy we need to blacklist it.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-13 11:54               ` Christoph Hellwig
  2010-08-13 18:15                 ` Hugh Dickins
@ 2010-08-13 18:15                 ` Hugh Dickins
  1 sibling, 0 replies; 30+ messages in thread
From: Hugh Dickins @ 2010-08-13 18:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Martin K. Petersen, LKML, James Bottomley, Mark Lord, pm list

On Fri, Aug 13, 2010 at 4:54 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Aug 06, 2010 at 03:07:25PM -0700, Hugh Dickins wrote:
>> If REQ_SOFTBARRIER means that the device is still free to reorder a
>> write, which was issued after discard completion was reported, before
>> the discard (so later discarding the data written), then certainly I
>> agree with Christoph (now Cc'ed) that the REQ_HARDBARRIER is
>> unavoidable there; but if not, then it's not needed for the swap case.
>>  I hope to gain a little more enlightenment on such barriers shortly.
>
> REQ_SOFTBARRIER is indeed purely a reordering barrier inside the block
> elevator.
>
>> What does seem over the top to me, is for mm/swapfile.c's
>> blkdev_issue_discard()s to be asking for both BLKDEV_IFL_WAIT and
>> BLKDEV_IFL_BARRIER: those swap discards were originally written just
>> to use barriers, without needing to wait for completion in there.  I'd
>> be interested to hear if cutting out the BLKDEV_IFL_WAITs makes the
>> swap discards behave acceptably again for you - but understand that
>> you won't have a chance to try that until later next week.
>
> That does indeed look incorrect to me.  Any kind of explicit waits
> usually mean the caller provides ordering.  Getting rid of
> BLKDEV_IFL_BARRIER in the swap code ASAP would indeed be beneficial
> given that we are trying to get rid of hard barriers completely soon.
> Auditing the existing blkdev_issue_discard callers in filesystems
> is high on the todo list for me.

Yes.

Above I was suggesting for Nigel to experiment with cutting out swap
discard's BLKDEV_IFL_WAITs - and the results of cutting those out but
leaving its BLKDEV_IFL_BARRIERs would still be interesting.  But after
digesting the LSF discussion and the email thread that led up to it, I
came to the same conclusion as you, that going forward we want to keep
its BLKDEV_IFL_WAITs (swapfile.c already provides all the other
synchronization for that to fit into - things like never freeing swap
while its still under writeback) and simply remove its
BLKDEV_IFL_BARRIERs.

However, I am still not quite sure that we can already make that
change for 2.6.35 (-stable).  Can you reassure me on the question I
raise above: if we issue a discard to a device with cache, wait for
"completion", then issue a write into the area spanned by that
discard, can we be certain that the write to backing store will not be
reordered before the discard of backing store (unless the device is
just broken)?  Without a  REQ_HARDBARRIER in the 2.6.35 scheme?  It
seems a very reasonable assumption to me, but I'm learning not to
depend upon reasonable assumptions here.  (By the way, it doesn't
matter at all whether writes not spanned by the discard pass it or
not.)

Hugh
_______________________________________________
linux-pm mailing list
linux-pm@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/linux-pm

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-06 22:07             ` Hugh Dickins
                                 ` (2 preceding siblings ...)
  2010-08-13 11:54               ` Christoph Hellwig
@ 2010-08-13 11:54               ` Christoph Hellwig
  3 siblings, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2010-08-13 11:54 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Christoph Hellwig, Mark Lord, LKML, pm list

On Fri, Aug 06, 2010 at 03:07:25PM -0700, Hugh Dickins wrote:
> If REQ_SOFTBARRIER means that the device is still free to reorder a
> write, which was issued after discard completion was reported, before
> the discard (so later discarding the data written), then certainly I
> agree with Christoph (now Cc'ed) that the REQ_HARDBARRIER is
> unavoidable there; but if not, then it's not needed for the swap case.
>  I hope to gain a little more enlightenment on such barriers shortly.

REQ_SOFTBARRIER is indeed purely a reordering barrier inside the block
elevator.

> What does seem over the top to me, is for mm/swapfile.c's
> blkdev_issue_discard()s to be asking for both BLKDEV_IFL_WAIT and
> BLKDEV_IFL_BARRIER: those swap discards were originally written just
> to use barriers, without needing to wait for completion in there.  I'd
> be interested to hear if cutting out the BLKDEV_IFL_WAITs makes the
> swap discards behave acceptably again for you - but understand that
> you won't have a chance to try that until later next week.

That does indeed look incorrect to me.  Any kind of explicit waits
usually mean the caller provides ordering.  Getting rid of
BLKDEV_IFL_BARRIER in the swap code ASAP would indeed be beneficial
given that we are trying to get rid of hard barriers completely soon.
Auditing the existing blkdev_issue_discard callers in filesystems
is high on the todo list for me.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-06  1:15         ` Hugh Dickins
  2010-08-06  4:40           ` Nigel Cunningham
@ 2010-08-06  4:40           ` Nigel Cunningham
  1 sibling, 0 replies; 30+ messages in thread
From: Nigel Cunningham @ 2010-08-06  4:40 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Mark Lord, LKML, pm list

Hi.

On 06/08/10 11:15, Hugh Dickins wrote:
> On Wed, Aug 4, 2010 at 11:28 PM, Nigel Cunningham<nigel@tuxonice.net>  wrote:
>> On 05/08/10 13:58, Hugh Dickins wrote:
>>> I agree it would make more sense to discard swap when freeing rather
>>> than when allocating, I wish we could.  But at the freeing point we're
>>> often holding a page_table spinlock at an outer level, and it's just
>>> one page we're given to free.  Freeing is an operation you want to be
>>> comfortable doing when you're short of resources, whereas discard is a
>>> kind of I/O operation which needs resources.

The more I think about this, the more it seems to me that doing the 
discard at allocation time is just wrong. How about a strategy in which 
you do the discard immediately if resources permit, or flag it as in 
need of doing (at a future swap free of that page or swap off time) if 
things are too pressured at the moment? I haven't put thought into what 
data structures could be used for that - just want to ask for now if 
you'd be happy with the idea of looking into a strategy like that.

>>> It happens that in the allocation path, there was already a place at
>>> which we scanned for a cluster of 1MB free (I'm thinking of 4kB pages
>>> when I say 1MB), so that was the neatest point at which to site the
>>> discard - though even there we have to be careful about racing
>>> allocations.
>>
>> Makes sense when you put it like that :)
>>
>> I know it's a bit messier, but would it be possible for us to modify the
>> behaviour depending on the reason for the allocation? (No page_table
>> spinlock holding when we're hibernating).
>
> But if we moved it to the swap free, it would occur in the swap free
> (if any) _prior_ to hibernating, when testing for hibernation would
> just say "no".

I see what you mean. I'm not so worried by that because if we're having 
to free swap in order to hibernate, things are going to be slow anyway 
and discards aren't going to stand out nearly so much.

I'm much more concerned by the fact that we're currently doing discards 
for swap that wasn't even being used. If I reboot, swapon and then 
hibernate without having touched swap so far, I still get the 'hit' on 
the whole 4GB. I don't think we should be storing on disk a "this hasn't 
been used since last discarded" flag, so I'm looking for other solutions.

>>> I did once try to go back and get it to work when freeing instead of
>>> allocating, gathering the swap slots up then freeing when convenient.
>>> It was messy, didn't work very well, and didn't show an improvement in
>>> performance (on what we were testing at the time).
>
> When I tried gathering together the frees, there were just too many
> short extents to make the discards worth doing that  way.

Okay.

>> For one or two at a time, I can see that would be the case. If it is
>> possible to do the discard of pages used for hibernation after we're
>> finished reading the image, that would be good. Even better would be to only
>> do the discard for pages that were actually used and just do a simple free
>> for ones that were only allocated.
>
> There are optimizations which could be done e.g. we discard the whole
> of swap at swapon time, then re-discard each 1MB as we begin to
> allocate from it.  Clearly that has a certain stupidity to it!  But
> the initial discard of the whole of swap should be efficient and worth
> doing.  We could keep track of which swap pages have already been
> discarded since last used, but that would take up another... it's not
> immediately clear to me whether it's another value or another bit of
> the swap count.
>
> We could provide an interface for hibernation, to do a minimal number
> of maximal range discards to suit hibernation (but we need to be very
> careful about ongoing allocation, as in another thread).
>
> But are these things worth doing?  I think not.  Discard is supposed
> to be helping not hindering: if the device only supports discard in a
> way that's so inefficient, we'd do better to blacklist it as not
> supporting discard at all.  My suspicion is that your SSD is of that
> kind: that it used not to be recognized as supporting discard, but now
> in 2.6.35 it is so recognized.  However, that's just a suspicion: let
> me not slander your SSD when it may be my code or someone else's to
> blame: needs testing.

I've done the bisect now - spent the time today instead of on the place, 
and it took me to fbbf055692aeb25c54c49d9ca84532de836fbba0. This is 
Christoph's Hellwig's patch "block: fix DISCARD_BARRIER requests".

Could the problem be that we're using DISCARD_FL_BARRIER too much? (I'm 
still looking at the code here, so am writing without having thought 
everything through _really_ carefully.

>> Of course I'm talking in ideals without having an intimate knowledge of the
>> swap allocation code or exactly how ugly the above would make it :)
>>
>>> I've not been able to test swap, with SSDs, for several months: that's
>>> a dreadful regression you've found, thanks a lot for reporting it:
>>> I'll be very interested to hear where you locate the cause.  If it
>>> needs changes to the way swap does discard, so be it.
>>
>> I'm traveling to the US on Saturday and have apparently been given one of
>> those nice seats with power, so I'll try and get the bisection done then.
>
> That would be helpful, but displays greater dedication than I'd offer myself!

Done without an airplane now. I don't have to be credited with too much 
dedication! :)

Nigel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-05  3:58     ` Hugh Dickins
  2010-08-05  6:28       ` Nigel Cunningham
@ 2010-08-05  6:28       ` Nigel Cunningham
  1 sibling, 0 replies; 30+ messages in thread
From: Nigel Cunningham @ 2010-08-05  6:28 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Mark Lord, LKML, pm list

Hi Hugh.

Thanks for the email.

On 05/08/10 13:58, Hugh Dickins wrote:
> On Wed, Aug 4, 2010 at 2:22 PM, Nigel Cunningham<nigel@tuxonice.net>  wrote:
>> On 04/08/10 22:44, Mark Lord wrote:
>>>
>>> Looks to me like more and more things are using the block discard
>>> functionality, and as predicted it is slowing things down enormously.
>>>
>>> The problem is that we still only discard tiny bits (a single range
>>> still??)
>>> per TRIM command, rather than batching larger ranges and larger numbers
>>> of ranges into single TRIM commands.
>>>
>>> That's a very poor implementation, especially when things start enabling
>>> it by default. Eg. the swap code, mke2fs, etc..
>>>
>>> Ugh.
>
> swap has been discarding since 2.6.29, on one 1MB range at a time.
> There's been no significant change at the swap end since then, but I
> guess more devices have been announcing themselves as nonrotational
> and supporting discard, and the implementation lower down has gone
> through a number of changes.

Okay; that's good to know.

>>
>> I was hoping for a nice quick and simple answer. Since I haven't got one,
>> I'll try to find time to do a git bisect. I think I'll also look at the swap
>> code more carefully and see if it's doing the sensible thing. I can't (at
>> the moment) see the logic behind calling discard when allocating swap. At
>> freeing time makes much more sense to me.
>
> I agree it would make more sense to discard swap when freeing rather
> than when allocating, I wish we could.  But at the freeing point we're
> often holding a page_table spinlock at an outer level, and it's just
> one page we're given to free.  Freeing is an operation you want to be
> comfortable doing when you're short of resources, whereas discard is a
> kind of I/O operation which needs resources.
>
> It happens that in the allocation path, there was already a place at
> which we scanned for a cluster of 1MB free (I'm thinking of 4kB pages
> when I say 1MB), so that was the neatest point at which to site the
> discard - though even there we have to be careful about racing
> allocations.

Makes sense when you put it like that :)

I know it's a bit messier, but would it be possible for us to modify the 
behaviour depending on the reason for the allocation? (No page_table 
spinlock holding when we're hibernating).

The issue isn't as noticable with [u]swsusp at the moment because 
they're allocating swap as the image is being written. If my current set 
of patches for Rafael get accepted, that will change (swap will be 
preallocated).

TuxOnIce always allocates all available storage since there's (usually) 
virtually zero cost of doing so and it then doesn't matter how much the 
drivers allocate when we do the atomic copy, or how good a compression 
ratio is achieved. That's what I'm aiming for in my patches for [u]swsusp.

> I did once try to go back and get it to work when freeing instead of
> allocating, gathering the swap slots up then freeing when convenient.
> It was messy, didn't work very well, and didn't show an improvement in
> performance (on what we were testing at the time).

For one or two at a time, I can see that would be the case. If it is 
possible to do the discard of pages used for hibernation after we're 
finished reading the image, that would be good. Even better would be to 
only do the discard for pages that were actually used and just do a 
simple free for ones that were only allocated.

Of course I'm talking in ideals without having an intimate knowledge of 
the swap allocation code or exactly how ugly the above would make it :)

> I've not been able to test swap, with SSDs, for several months: that's
> a dreadful regression you've found, thanks a lot for reporting it:
> I'll be very interested to hear where you locate the cause.  If it
> needs changes to the way swap does discard, so be it.

I'm traveling to the US on Saturday and have apparently been given one 
of those nice seats with power, so I'll try and get the bisection done then.

TTFN

Nigel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: 2.6.35 Regression: Ages spent discarding blocks that weren't used!
  2010-08-04 21:22   ` Nigel Cunningham
  2010-08-05  3:58     ` Hugh Dickins
@ 2010-08-05  3:58     ` Hugh Dickins
  1 sibling, 0 replies; 30+ messages in thread
From: Hugh Dickins @ 2010-08-05  3:58 UTC (permalink / raw)
  To: Nigel Cunningham; +Cc: Mark Lord, LKML, pm list

On Wed, Aug 4, 2010 at 2:22 PM, Nigel Cunningham <nigel@tuxonice.net> wrote:
> On 04/08/10 22:44, Mark Lord wrote:
>>
>> Looks to me like more and more things are using the block discard
>> functionality, and as predicted it is slowing things down enormously.
>>
>> The problem is that we still only discard tiny bits (a single range
>> still??)
>> per TRIM command, rather than batching larger ranges and larger numbers
>> of ranges into single TRIM commands.
>>
>> That's a very poor implementation, especially when things start enabling
>> it by default. Eg. the swap code, mke2fs, etc..
>>
>> Ugh.

swap has been discarding since 2.6.29, on one 1MB range at a time.
There's been no significant change at the swap end since then, but I
guess more devices have been announcing themselves as nonrotational
and supporting discard, and the implementation lower down has gone
through a number of changes.

>
> I was hoping for a nice quick and simple answer. Since I haven't got one,
> I'll try to find time to do a git bisect. I think I'll also look at the swap
> code more carefully and see if it's doing the sensible thing. I can't (at
> the moment) see the logic behind calling discard when allocating swap. At
> freeing time makes much more sense to me.

I agree it would make more sense to discard swap when freeing rather
than when allocating, I wish we could.  But at the freeing point we're
often holding a page_table spinlock at an outer level, and it's just
one page we're given to free.  Freeing is an operation you want to be
comfortable doing when you're short of resources, whereas discard is a
kind of I/O operation which needs resources.

It happens that in the allocation path, there was already a place at
which we scanned for a cluster of 1MB free (I'm thinking of 4kB pages
when I say 1MB), so that was the neatest point at which to site the
discard - though even there we have to be careful about racing
allocations.

I did once try to go back and get it to work when freeing instead of
allocating, gathering the swap slots up then freeing when convenient.
It was messy, didn't work very well, and didn't show an improvement in
performance (on what we were testing at the time).

I've not been able to test swap, with SSDs, for several months: that's
a dreadful regression you've found, thanks a lot for reporting it:
I'll be very interested to hear where you locate the cause.  If it
needs changes to the way swap does discard, so be it.

Hugh

^ permalink raw reply	[flat|nested] 30+ messages in thread

* 2.6.35 Regression: Ages spent discarding blocks that weren't used!
@ 2010-08-04  1:40 Nigel Cunningham
  0 siblings, 0 replies; 30+ messages in thread
From: Nigel Cunningham @ 2010-08-04  1:40 UTC (permalink / raw)
  To: LKML, pm list

Hi all.

I've just given hibernation a go under 2.6.35, and at first I thought 
there was some sort of hang in freezing processes. The computer sat 
there for aaaaaages, apparently doing nothing. Switched from TuxOnIce to 
swsusp to see if it was specific to my code but no - the problem was 
there too. I used the nifty new kdb support to get a backtrace, which was:

get_swap_page_of_type
discard_swap_cluster
blk_dev_issue_discard
wait_for_completion

Adding a printk in discard swap cluster gives the following:

[   46.758330] Discarding 256 pages from bdev 800003 beginning at page 
640377.
[   47.003363] Discarding 256 pages from bdev 800003 beginning at page 
640633.
[   47.246514] Discarding 256 pages from bdev 800003 beginning at page 
640889.

...

[  221.877465] Discarding 256 pages from bdev 800003 beginning at page 
826745.
[  222.121284] Discarding 256 pages from bdev 800003 beginning at page 
827001.
[  222.365908] Discarding 256 pages from bdev 800003 beginning at page 
827257.
[  222.610311] Discarding 256 pages from bdev 800003 beginning at page 
827513.

So allocating 4GB of swap on my SSD now takes 176 seconds instead of 
virtually no time at all. (This code is completely unchanged from 2.6.34).

I have a couple of questions:

1) As far as I can see, there haven't been any changes in mm/swapfile.c 
that would cause this slowdown, so something in the block layer has 
(from my point of view) regressed. Is this a known issue?

2) Why are we calling discard_swap_cluster anyway? The swap was unused 
and we're allocating it. I could understand calling it when freeing 
swap, but when allocating?

Regards,

Nigel

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2010-08-14 11:44 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-04  1:40 2.6.35 Regression: Ages spent discarding blocks that weren't used! Nigel Cunningham
2010-08-04  8:59 ` Stefan Richter
2010-08-04  9:16   ` Nigel Cunningham
2010-08-04  9:16   ` Nigel Cunningham
2010-08-04  8:59 ` Stefan Richter
2010-08-04 12:44 ` Mark Lord
2010-08-04 12:44 ` Mark Lord
2010-08-04 18:02   ` Martin K. Petersen
2010-08-04 18:02   ` Martin K. Petersen
2010-08-04 21:22   ` Nigel Cunningham
2010-08-04 21:22   ` Nigel Cunningham
2010-08-05  3:58     ` Hugh Dickins
2010-08-05  6:28       ` Nigel Cunningham
2010-08-06  1:15         ` Hugh Dickins
2010-08-06  1:15         ` Hugh Dickins
2010-08-06  4:40           ` Nigel Cunningham
2010-08-06 22:07             ` Hugh Dickins
2010-08-06 22:07             ` Hugh Dickins
2010-08-07 22:47               ` Nigel Cunningham
2010-08-07 22:47               ` Nigel Cunningham
2010-08-13 11:54               ` Christoph Hellwig
2010-08-13 18:15                 ` Hugh Dickins
2010-08-14 11:43                   ` Christoph Hellwig
2010-08-14 11:43                   ` Christoph Hellwig
2010-08-13 18:15                 ` Hugh Dickins
2010-08-13 11:54               ` Christoph Hellwig
2010-08-06  4:40           ` Nigel Cunningham
2010-08-05  6:28       ` Nigel Cunningham
2010-08-05  3:58     ` Hugh Dickins
2010-08-04  1:40 Nigel Cunningham

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.