All of lore.kernel.org
 help / color / mirror / Atom feed
* After memory pressure: can't read from tape anymore
@ 2010-11-28 19:15 Lukas Kolbe
  2010-11-29 17:09 ` Kai Makisara
  0 siblings, 1 reply; 38+ messages in thread
From: Lukas Kolbe @ 2010-11-28 19:15 UTC (permalink / raw)
  To: linux-scsi

Hi, 

On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E,
Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on
debian/squeeze), we see reproducible tape read and write failures after
the system was under memory pressure:

[342567.297152] st0: Can't allocate 2097152 byte tape buffer.
[342569.316099] st0: Can't allocate 2097152 byte tape buffer.
[342570.805164] st0: Can't allocate 2097152 byte tape buffer.
[342571.958331] st0: Can't allocate 2097152 byte tape buffer.
[342572.704264] st0: Can't allocate 2097152 byte tape buffer.
[342873.737130] st: from_buffer offset overflow.

Bacula is spewing this message every time it tries to access the tape
drive:
28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output error

By memory pressure, I mean that the KVM processes containing the
postgres-db (~20million files) and the bacula director have used all
available RAM, one of them used ~4GiB of its 12GiB swap for an hour or
so (by selecting a full restore, it seems that the whole directory tree
of the 15mio files backup gets read into memory). After this, I wasn't
able to read from the second tape drive anymore (/dev/st0); whereas the
first tape drive was restoring the data happily (it is currently about
halfway through a 3TiB restore from 5 tapes).

This same behaviour appears when we're doing a few incremental backups;
after a while, it just isn't possible to use the tape drives anymore -
every I/O operation gives an I/O Error, even a simple dd bs=64k
count=10. After a restart, the system behaves correctly until
-seemingly- another memory pressure situation occured.

I'd be delighted if somebody can help me debug this; my systemtap skills
are non-existent unfortunatly.

kind regads,
Lukas Kolbe



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-28 19:15 After memory pressure: can't read from tape anymore Lukas Kolbe
@ 2010-11-29 17:09 ` Kai Makisara
  2010-11-30 13:31   ` Lukas Kolbe
  2010-12-03 12:27   ` FUJITA Tomonori
  0 siblings, 2 replies; 38+ messages in thread
From: Kai Makisara @ 2010-11-29 17:09 UTC (permalink / raw)
  To: Lukas Kolbe; +Cc: linux-scsi

On Sun, 28 Nov 2010, Lukas Kolbe wrote:

> Hi, 
> 
> On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E,
> Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on
> debian/squeeze), we see reproducible tape read and write failures after
> the system was under memory pressure:
> 
> [342567.297152] st0: Can't allocate 2097152 byte tape buffer.
> [342569.316099] st0: Can't allocate 2097152 byte tape buffer.
> [342570.805164] st0: Can't allocate 2097152 byte tape buffer.
> [342571.958331] st0: Can't allocate 2097152 byte tape buffer.
> [342572.704264] st0: Can't allocate 2097152 byte tape buffer.
> [342873.737130] st: from_buffer offset overflow.
> 
> Bacula is spewing this message every time it tries to access the tape
> drive:
> 28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output error
> 
> By memory pressure, I mean that the KVM processes containing the
> postgres-db (~20million files) and the bacula director have used all
> available RAM, one of them used ~4GiB of its 12GiB swap for an hour or
> so (by selecting a full restore, it seems that the whole directory tree
> of the 15mio files backup gets read into memory). After this, I wasn't
> able to read from the second tape drive anymore (/dev/st0); whereas the
> first tape drive was restoring the data happily (it is currently about
> halfway through a 3TiB restore from 5 tapes).
> 
> This same behaviour appears when we're doing a few incremental backups;
> after a while, it just isn't possible to use the tape drives anymore -
> every I/O operation gives an I/O Error, even a simple dd bs=64k
> count=10. After a restart, the system behaves correctly until
> -seemingly- another memory pressure situation occured.
> 
This is predictable. The maximum number of scatter/gather segments seems 
to be 128. The st driver first tries to set up transfer directly from the 
user buffer to the HBA. The user buffer is usually fragmented so that one 
scatter/gather segment is used for each page. Assuming 4 kB page size, the 
maximu size of the direct transfer is 128 x 4 = 512 kB.

When this fails, the driver tries to allocate a kernel buffer so that 
there larger than 4 kB physically contiguous segments. Let's assume that 
it can find 128 16 kB segments. In this case the maximum block size is 
2048 kB. Memory pressure results in memory fragmentation and the driver 
can't find large enough segments and allocation fails. This is what you 
are seeing.

So, one solution is to use 512 kB block size. Another one is to try to 
find out if the 128 segment limit is a physical limitation or just a 
choice. In the latter case the mptsas driver could be modified to support 
larger block size even after memory fragmentation.

Kai


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-29 17:09 ` Kai Makisara
@ 2010-11-30 13:31   ` Lukas Kolbe
  2010-11-30 16:10     ` Boaz Harrosh
  2010-11-30 16:20     ` Kai Makisara
  2010-12-03 12:27   ` FUJITA Tomonori
  1 sibling, 2 replies; 38+ messages in thread
From: Lukas Kolbe @ 2010-11-30 13:31 UTC (permalink / raw)
  To: Kai Makisara; +Cc: linux-scsi

On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote:

Hi,

> > On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E,
> > Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on
> > debian/squeeze), we see reproducible tape read and write failures after
> > the system was under memory pressure:
> > 
> > [342567.297152] st0: Can't allocate 2097152 byte tape buffer.
> > [342569.316099] st0: Can't allocate 2097152 byte tape buffer.
> > [342570.805164] st0: Can't allocate 2097152 byte tape buffer.
> > [342571.958331] st0: Can't allocate 2097152 byte tape buffer.
> > [342572.704264] st0: Can't allocate 2097152 byte tape buffer.
> > [342873.737130] st: from_buffer offset overflow.
> > 
> > Bacula is spewing this message every time it tries to access the tape
> > drive:
> > 28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output error
> > 
> > By memory pressure, I mean that the KVM processes containing the
> > postgres-db (~20million files) and the bacula director have used all
> > available RAM, one of them used ~4GiB of its 12GiB swap for an hour or
> > so (by selecting a full restore, it seems that the whole directory tree
> > of the 15mio files backup gets read into memory). After this, I wasn't
> > able to read from the second tape drive anymore (/dev/st0); whereas the
> > first tape drive was restoring the data happily (it is currently about
> > halfway through a 3TiB restore from 5 tapes).
> > 
> > This same behaviour appears when we're doing a few incremental backups;
> > after a while, it just isn't possible to use the tape drives anymore -
> > every I/O operation gives an I/O Error, even a simple dd bs=64k
> > count=10. After a restart, the system behaves correctly until
> > -seemingly- another memory pressure situation occured.
> > 
> This is predictable. The maximum number of scatter/gather segments seems 
> to be 128. The st driver first tries to set up transfer directly from the 
> user buffer to the HBA. The user buffer is usually fragmented so that one 
> scatter/gather segment is used for each page. Assuming 4 kB page size, the 
> maximu size of the direct transfer is 128 x 4 = 512 kB.
> 
> When this fails, the driver tries to allocate a kernel buffer so that 
> there larger than 4 kB physically contiguous segments. Let's assume that 
> it can find 128 16 kB segments. In this case the maximum block size is 
> 2048 kB. Memory pressure results in memory fragmentation and the driver 
> can't find large enough segments and allocation fails. This is what you 
> are seeing.

Reasonable explanation, thanks. What makes me wonder is why it still
fails *after* memory pressure was gone - ie free shows more than 4GiB of
free memory. I had the output of /proc/meminfo at that time but can't
find it anymore :/

> So, one solution is to use 512 kB block size. Another one is to try to 
> find out if the 128 segment limit is a physical limitation or just a 
> choice. In the latter case the mptsas driver could be modified to support 
> larger block size even after memory fragmentation.

Even with 64kb blocksize (dd bs=64k), I was getting I/O errors trying to
access the tape drive. I am now trying to upper the max_sg_segs
parameter to the st module (modinfo says 256 is the default; I'm trying
1024 now) and see how well this works under memory pressure.

> Kai

-- 
Lukas



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-30 13:31   ` Lukas Kolbe
@ 2010-11-30 16:10     ` Boaz Harrosh
  2010-11-30 16:23       ` Kai Makisara
  2010-11-30 16:20     ` Kai Makisara
  1 sibling, 1 reply; 38+ messages in thread
From: Boaz Harrosh @ 2010-11-30 16:10 UTC (permalink / raw)
  To: Lukas Kolbe; +Cc: Kai Makisara, linux-scsi

On 11/30/2010 03:31 PM, Lukas Kolbe wrote:
> On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote:
> 
> Hi,
> 
>>> On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E,
>>> Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on
>>> debian/squeeze), we see reproducible tape read and write failures after
>>> the system was under memory pressure:
>>>
>>> [342567.297152] st0: Can't allocate 2097152 byte tape buffer.
>>> [342569.316099] st0: Can't allocate 2097152 byte tape buffer.
>>> [342570.805164] st0: Can't allocate 2097152 byte tape buffer.
>>> [342571.958331] st0: Can't allocate 2097152 byte tape buffer.
>>> [342572.704264] st0: Can't allocate 2097152 byte tape buffer.
>>> [342873.737130] st: from_buffer offset overflow.
>>>
>>> Bacula is spewing this message every time it tries to access the tape
>>> drive:
>>> 28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output error
>>>
>>> By memory pressure, I mean that the KVM processes containing the
>>> postgres-db (~20million files) and the bacula director have used all
>>> available RAM, one of them used ~4GiB of its 12GiB swap for an hour or
>>> so (by selecting a full restore, it seems that the whole directory tree
>>> of the 15mio files backup gets read into memory). After this, I wasn't
>>> able to read from the second tape drive anymore (/dev/st0); whereas the
>>> first tape drive was restoring the data happily (it is currently about
>>> halfway through a 3TiB restore from 5 tapes).
>>>
>>> This same behaviour appears when we're doing a few incremental backups;
>>> after a while, it just isn't possible to use the tape drives anymore -
>>> every I/O operation gives an I/O Error, even a simple dd bs=64k
>>> count=10. After a restart, the system behaves correctly until
>>> -seemingly- another memory pressure situation occured.
>>>
>> This is predictable. The maximum number of scatter/gather segments seems 
>> to be 128. The st driver first tries to set up transfer directly from the 
>> user buffer to the HBA. The user buffer is usually fragmented so that one 
>> scatter/gather segment is used for each page. Assuming 4 kB page size, the 
>> maximu size of the direct transfer is 128 x 4 = 512 kB.
>>
>> When this fails, the driver tries to allocate a kernel buffer so that 
>> there larger than 4 kB physically contiguous segments. Let's assume that 
>> it can find 128 16 kB segments. In this case the maximum block size is 
>> 2048 kB. Memory pressure results in memory fragmentation and the driver 
>> can't find large enough segments and allocation fails. This is what you 
>> are seeing.
> 
> Reasonable explanation, thanks. What makes me wonder is why it still
> fails *after* memory pressure was gone - ie free shows more than 4GiB of
> free memory. I had the output of /proc/meminfo at that time but can't
> find it anymore :/
> 
>> So, one solution is to use 512 kB block size. Another one is to try to 
>> find out if the 128 segment limit is a physical limitation or just a 
>> choice. In the latter case the mptsas driver could be modified to support 
>> larger block size even after memory fragmentation.
> 
> Even with 64kb blocksize (dd bs=64k), I was getting I/O errors trying to
> access the tape drive. I am now trying to upper the max_sg_segs
> parameter to the st module (modinfo says 256 is the default; I'm trying
> 1024 now) and see how well this works under memory pressure.
> 

It looks like something is broken/old-code in sr. Most important LLDs
and block-layer scsi-ml fully support sg-chaining that effectively are
able to deliver limitless (Only limited by HW) sg sizes. It looks like
sr has some code that tries to allocate contiguous buffers larger than
PAGE_SIZE. Why does it do that? It should not be necessary any more.

>> Kai
> 

Boaz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-30 13:31   ` Lukas Kolbe
  2010-11-30 16:10     ` Boaz Harrosh
@ 2010-11-30 16:20     ` Kai Makisara
  2010-12-01 17:06       ` Lukas Kolbe
  1 sibling, 1 reply; 38+ messages in thread
From: Kai Makisara @ 2010-11-30 16:20 UTC (permalink / raw)
  To: Lukas Kolbe; +Cc: linux-scsi

On Tue, 30 Nov 2010, Lukas Kolbe wrote:

> On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote:
> 
> Hi,
> 
> > > On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E,
> > > Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on
> > > debian/squeeze), we see reproducible tape read and write failures after
> > > the system was under memory pressure:
> > > 
> > > [342567.297152] st0: Can't allocate 2097152 byte tape buffer.
> > > [342569.316099] st0: Can't allocate 2097152 byte tape buffer.
> > > [342570.805164] st0: Can't allocate 2097152 byte tape buffer.
> > > [342571.958331] st0: Can't allocate 2097152 byte tape buffer.
> > > [342572.704264] st0: Can't allocate 2097152 byte tape buffer.
> > > [342873.737130] st: from_buffer offset overflow.
> > > 
...
> > When this fails, the driver tries to allocate a kernel buffer so that 
> > there larger than 4 kB physically contiguous segments. Let's assume that 
> > it can find 128 16 kB segments. In this case the maximum block size is 
> > 2048 kB. Memory pressure results in memory fragmentation and the driver 
> > can't find large enough segments and allocation fails. This is what you 
> > are seeing.
> 
> Reasonable explanation, thanks. What makes me wonder is why it still
> fails *after* memory pressure was gone - ie free shows more than 4GiB of
> free memory. I had the output of /proc/meminfo at that time but can't
> find it anymore :/
> 
This is because (AFAIK) the kernel does not defragment the memory. There 
may be contiguous free pages but the memory management data structures 
don't show these.

> > So, one solution is to use 512 kB block size. Another one is to try to 
> > find out if the 128 segment limit is a physical limitation or just a 
> > choice. In the latter case the mptsas driver could be modified to support 
> > larger block size even after memory fragmentation.
> 
> Even with 64kb blocksize (dd bs=64k), I was getting I/O errors trying to
> access the tape drive. I am now trying to upper the max_sg_segs
> parameter to the st module (modinfo says 256 is the default; I'm trying
> 1024 now) and see how well this works under memory pressure.
> 
This will not help. The final limit is the minimum of the limit of st and 
the limit of mtpsas. The mptsas limit is 128. This is the limit that 
should be increased but I don't know if it is possible.

If you see error with 64 kB block size, I would like to see any messages 
associated with these errors.

Kai

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-30 16:10     ` Boaz Harrosh
@ 2010-11-30 16:23       ` Kai Makisara
  2010-11-30 16:44         ` Boaz Harrosh
  0 siblings, 1 reply; 38+ messages in thread
From: Kai Makisara @ 2010-11-30 16:23 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Lukas Kolbe, linux-scsi

On Tue, 30 Nov 2010, Boaz Harrosh wrote:

> On 11/30/2010 03:31 PM, Lukas Kolbe wrote:
> > On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote:
> > 
....
> It looks like something is broken/old-code in sr. Most important LLDs
> and block-layer scsi-ml fully support sg-chaining that effectively are
> able to deliver limitless (Only limited by HW) sg sizes. It looks like
> sr has some code that tries to allocate contiguous buffers larger than
> PAGE_SIZE. Why does it do that? It should not be necessary any more.
> 
The relevant driver is st and it use sg chaining when necessary. I tried 
to explain that the effective limit in this case comes from mptsas. I 
don't know if it is HW limit or driver limit.

Kai


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-30 16:23       ` Kai Makisara
@ 2010-11-30 16:44         ` Boaz Harrosh
  2010-11-30 17:04           ` Kai Makisara
  0 siblings, 1 reply; 38+ messages in thread
From: Boaz Harrosh @ 2010-11-30 16:44 UTC (permalink / raw)
  To: Kai Makisara; +Cc: Lukas Kolbe, linux-scsi

On 11/30/2010 06:23 PM, Kai Makisara wrote:
> On Tue, 30 Nov 2010, Boaz Harrosh wrote:
> 
>> On 11/30/2010 03:31 PM, Lukas Kolbe wrote:
>>> On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote:
>>>
> ....
>> It looks like something is broken/old-code in sr. Most important LLDs
>> and block-layer scsi-ml fully support sg-chaining that effectively are
>> able to deliver limitless (Only limited by HW) sg sizes. It looks like
>> sr has some code that tries to allocate contiguous buffers larger than
>> PAGE_SIZE. Why does it do that? It should not be necessary any more.
>>
> The relevant driver is st 

Sorry I meant st, yes.

> and it use sg chaining when necessary. I tried 
> to explain that the effective limit in this case comes from mptsas. I 
> don't know if it is HW limit or driver limit.
> 

Than I don't understand where is the failing allocation. Where in the
code path anyone is trying to allocate something bigger then a page?
Please explain?

> Kai
> 

Thanks
Boaz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-30 16:44         ` Boaz Harrosh
@ 2010-11-30 17:04           ` Kai Makisara
  2010-11-30 17:24             ` Boaz Harrosh
  0 siblings, 1 reply; 38+ messages in thread
From: Kai Makisara @ 2010-11-30 17:04 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Lukas Kolbe, linux-scsi

On Tue, 30 Nov 2010, Boaz Harrosh wrote:

> On 11/30/2010 06:23 PM, Kai Makisara wrote:
> > On Tue, 30 Nov 2010, Boaz Harrosh wrote:
> > 
> >> On 11/30/2010 03:31 PM, Lukas Kolbe wrote:
> >>> On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote:
> >>>
> > ....
> >> It looks like something is broken/old-code in sr. Most important LLDs
> >> and block-layer scsi-ml fully support sg-chaining that effectively are
> >> able to deliver limitless (Only limited by HW) sg sizes. It looks like
> >> sr has some code that tries to allocate contiguous buffers larger than
> >> PAGE_SIZE. Why does it do that? It should not be necessary any more.
> >>
> > The relevant driver is st 
> 
> Sorry I meant st, yes.
> 
> > and it use sg chaining when necessary. I tried 
> > to explain that the effective limit in this case comes from mptsas. I 
> > don't know if it is HW limit or driver limit.
> > 
> 
> Than I don't understand where is the failing allocation. Where in the
> code path anyone is trying to allocate something bigger then a page?
> Please explain?
> 
The function enlarge_buffer() in st.c tries to allocate a driver buffer 
that is large enough for one block so that the number of contiguous memory 
blocks does not exceed the allowed maximum. Allocation is done using 
alloc_pages() (at line 3744 in st.c in 2.6.36), usually with order > 0.

Kai

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-30 17:04           ` Kai Makisara
@ 2010-11-30 17:24             ` Boaz Harrosh
  2010-11-30 19:53               ` Kai Makisara
  2010-12-03  9:44               ` FUJITA Tomonori
  0 siblings, 2 replies; 38+ messages in thread
From: Boaz Harrosh @ 2010-11-30 17:24 UTC (permalink / raw)
  To: Kai Makisara; +Cc: Lukas Kolbe, linux-scsi

On 11/30/2010 07:04 PM, Kai Makisara wrote:
> On Tue, 30 Nov 2010, Boaz Harrosh wrote:
> 
>> On 11/30/2010 06:23 PM, Kai Makisara wrote:
>>> On Tue, 30 Nov 2010, Boaz Harrosh wrote:
>>>
>>>> On 11/30/2010 03:31 PM, Lukas Kolbe wrote:
>>>>> On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote:
>>>>>
>>> ....
>>>> It looks like something is broken/old-code in sr. Most important LLDs
>>>> and block-layer scsi-ml fully support sg-chaining that effectively are
>>>> able to deliver limitless (Only limited by HW) sg sizes. It looks like
>>>> sr has some code that tries to allocate contiguous buffers larger than
>>>> PAGE_SIZE. Why does it do that? It should not be necessary any more.
>>>>
>>> The relevant driver is st 
>>
>> Sorry I meant st, yes.
>>
>>> and it use sg chaining when necessary. I tried 
>>> to explain that the effective limit in this case comes from mptsas. I 
>>> don't know if it is HW limit or driver limit.
>>>
>>
>> Than I don't understand where is the failing allocation. Where in the
>> code path anyone is trying to allocate something bigger then a page?
>> Please explain?
>>
> The function enlarge_buffer() in st.c tries to allocate a driver buffer 
> that is large enough for one block so that the number of contiguous memory 
> blocks does not exceed the allowed maximum. Allocation is done using 
> alloc_pages() (at line 3744 in st.c in 2.6.36), usually with order > 0.
> 

I looked at enlarge_buffer() and it looks fragile and broken. If you really
need a pointer eg:
	STbuffer->b_data = page_address(STbuffer->reserved_pages[0]);

Than way not use vmalloc() for buffers larger then PAGE_SIZE? But better yet
avoid it by keeping a pages_array or sg-list and operate on an aio type
operations.

> Kai

But I understand this is a lot of work on an old driver. Perhaps pre-allocate
something big at startup. specified by user?

Thanks
Boaz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-30 17:24             ` Boaz Harrosh
@ 2010-11-30 19:53               ` Kai Makisara
  2010-12-01  9:40                 ` Lukas Kolbe
  2010-12-02 10:01                 ` Lukas Kolbe
  2010-12-03  9:44               ` FUJITA Tomonori
  1 sibling, 2 replies; 38+ messages in thread
From: Kai Makisara @ 2010-11-30 19:53 UTC (permalink / raw)
  To: Boaz Harrosh; +Cc: Lukas Kolbe, linux-scsi

On Tue, 30 Nov 2010, Boaz Harrosh wrote:

...
> I looked at enlarge_buffer() and it looks fragile and broken. If you really
> need a pointer eg:
> 	STbuffer->b_data = page_address(STbuffer->reserved_pages[0]);
> 
If you think it is broken, please fix it.

> Than way not use vmalloc() for buffers larger then PAGE_SIZE? But better yet
> avoid it by keeping a pages_array or sg-list and operate on an aio type
> operations.
> 
vmalloc() is not a solution here. Think about this from the HBA side. Each 
s/g segment must be contiguous in the address space the HBA uses. In many 
cases this is the physical memory address space. Any solution must make 
sure that the HBA can perform the requested data transfer.

> > Kai
> 
> But I understand this is a lot of work on an old driver. Perhaps pre-allocate
> something big at startup. specified by user?
> 
This used to be possible at some time and it could be made possible again. 
But I don't like this option because it means that the users must 
explicitly set the boot parameters.

And it is difficult for me to believe the modern SAS HBAs only support 128 
s/g segments.

Kai

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-30 19:53               ` Kai Makisara
@ 2010-12-01  9:40                 ` Lukas Kolbe
  2010-12-02 11:17                   ` Desai, Kashyap
  2010-12-02 10:01                 ` Lukas Kolbe
  1 sibling, 1 reply; 38+ messages in thread
From: Lukas Kolbe @ 2010-12-01  9:40 UTC (permalink / raw)
  To: Kai Makisara; +Cc: Boaz Harrosh, linux-scsi, Kashyap Desai

Am Dienstag, den 30.11.2010, 21:53 +0200 schrieb Kai Makisara:
> On Tue, 30 Nov 2010, Boaz Harrosh wrote:

I'm Cc'ing Desay Kashyap from LSI, maybe he can comment on the hardware
limitations of the SAS1068E?

> ...
> > I looked at enlarge_buffer() and it looks fragile and broken. If you really
> > need a pointer eg:
> > 	STbuffer->b_data = page_address(STbuffer->reserved_pages[0]);
> > 
> If you think it is broken, please fix it.
> 
> > Than way not use vmalloc() for buffers larger then PAGE_SIZE? But better yet
> > avoid it by keeping a pages_array or sg-list and operate on an aio type
> > operations.
> > 
> vmalloc() is not a solution here. Think about this from the HBA side. Each 
> s/g segment must be contiguous in the address space the HBA uses. In many 
> cases this is the physical memory address space. Any solution must make 
> sure that the HBA can perform the requested data transfer.
> 
> > > Kai
> > 
> > But I understand this is a lot of work on an old driver. Perhaps pre-allocate
> > something big at startup. specified by user?
> > 
> This used to be possible at some time and it could be made possible again. 
> But I don't like this option because it means that the users must 
> explicitly set the boot parameters.
> 
> And it is difficult for me to believe the modern SAS HBAs only support 128 
> s/g segments.
> 
> Kai


For reference, here's my original message with Kais reply:

> Hi, 
> 
> On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E,
> Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on
> debian/squeeze), we see reproducible tape read and write failures after
> the system was under memory pressure:
> 
> [342567.297152] st0: Can't allocate 2097152 byte tape buffer.
> [342569.316099] st0: Can't allocate 2097152 byte tape buffer.
> [342570.805164] st0: Can't allocate 2097152 byte tape buffer.
> [342571.958331] st0: Can't allocate 2097152 byte tape buffer.
> [342572.704264] st0: Can't allocate 2097152 byte tape buffer.
> [342873.737130] st: from_buffer offset overflow.
> 
> Bacula is spewing this message every time it tries to access the tape
> drive:
> 28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output error
> 
> By memory pressure, I mean that the KVM processes containing the
> postgres-db (~20million files) and the bacula director have used all
> available RAM, one of them used ~4GiB of its 12GiB swap for an hour or
> so (by selecting a full restore, it seems that the whole directory tree
> of the 15mio files backup gets read into memory). After this, I wasn't
> able to read from the second tape drive anymore (/dev/st0); whereas the
> first tape drive was restoring the data happily (it is currently about
> halfway through a 3TiB restore from 5 tapes).
> 
> This same behaviour appears when we're doing a few incremental backups;
> after a while, it just isn't possible to use the tape drives anymore -
> every I/O operation gives an I/O Error, even a simple dd bs=64k
> count=10. After a restart, the system behaves correctly until
> -seemingly- another memory pressure situation occured.
> 
This is predictable. The maximum number of scatter/gather segments seems 
to be 128. The st driver first tries to set up transfer directly from the 
user buffer to the HBA. The user buffer is usually fragmented so that one 
scatter/gather segment is used for each page. Assuming 4 kB page size, the 
maximu size of the direct transfer is 128 x 4 = 512 kB.

When this fails, the driver tries to allocate a kernel buffer so that 
there larger than 4 kB physically contiguous segments. Let's assume that 
it can find 128 16 kB segments. In this case the maximum block size is 
2048 kB. Memory pressure results in memory fragmentation and the driver 
can't find large enough segments and allocation fails. This is what you 
are seeing.

So, one solution is to use 512 kB block size. Another one is to try to 
find out if the 128 segment limit is a physical limitation or just a 
choice. In the latter case the mptsas driver could be modified to support 
larger block size even after memory fragmentation.

Kai






^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-30 16:20     ` Kai Makisara
@ 2010-12-01 17:06       ` Lukas Kolbe
  2010-12-02 16:41         ` Kai Makisara
  0 siblings, 1 reply; 38+ messages in thread
From: Lukas Kolbe @ 2010-12-01 17:06 UTC (permalink / raw)
  To: Kai Makisara; +Cc: linux-scsi, Kashyap Desai

Am Dienstag, den 30.11.2010, 18:20 +0200 schrieb Kai Makisara:

Hi, in reply to your earlier mail:

> On Tue, 30 Nov 2010, Lukas Kolbe wrote:
> 
> > On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote:
> > 
> > Hi,
> > 
> > > > On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E,
> > > > Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on
> > > > debian/squeeze), we see reproducible tape read and write failures after
> > > > the system was under memory pressure:
> > > > 
> > > > [342567.297152] st0: Can't allocate 2097152 byte tape buffer.
> > > > [342569.316099] st0: Can't allocate 2097152 byte tape buffer.
> > > > [342570.805164] st0: Can't allocate 2097152 byte tape buffer.
> > > > [342571.958331] st0: Can't allocate 2097152 byte tape buffer.
> > > > [342572.704264] st0: Can't allocate 2097152 byte tape buffer.
> > > > [342873.737130] st: from_buffer offset overflow.
> > > > 
> ...
> > > When this fails, the driver tries to allocate a kernel buffer so that 
> > > there larger than 4 kB physically contiguous segments. Let's assume that 
> > > it can find 128 16 kB segments. In this case the maximum block size is 
> > > 2048 kB. Memory pressure results in memory fragmentation and the driver 
> > > can't find large enough segments and allocation fails. This is what you 
> > > are seeing.
> > 
> > Reasonable explanation, thanks. What makes me wonder is why it still
> > fails *after* memory pressure was gone - ie free shows more than 4GiB of
> > free memory. I had the output of /proc/meminfo at that time but can't
> > find it anymore :/
> > 
> This is because (AFAIK) the kernel does not defragment the memory. There 
> may be contiguous free pages but the memory management data structures 
> don't show these.
> 
> > > So, one solution is to use 512 kB block size. Another one is to try to 
> > > find out if the 128 segment limit is a physical limitation or just a 
> > > choice. In the latter case the mptsas driver could be modified to support 
> > > larger block size even after memory fragmentation.
> > 
> > Even with 64kb blocksize (dd bs=64k), I was getting I/O errors trying to
> > access the tape drive. I am now trying to upper the max_sg_segs
> > parameter to the st module (modinfo says 256 is the default; I'm trying
> > 1024 now) and see how well this works under memory pressure.
> > 
> This will not help. The final limit is the minimum of the limit of st and 
> the limit of mtpsas. The mptsas limit is 128. This is the limit that 
> should be increased but I don't know if it is possible.
> 
> If you see error with 64 kB block size, I would like to see any messages 
> associated with these errors.

I have now hit this bug again. Trying to read and write a label from the
tape drive in question results in this (via bacula's btape command):

*readlabel
01-Dec 17:47 btape JobId 0: Error: block.c:1002 Read error on fd=3 at
file:blk 0:0 on device "drv1" (/dev/nst1). ERR=Value too large for
defined data type.
btape: btape.c:525 Volume has no label.

Volume Label:
Id                : **error**VerNo             : 0
VolName           :
PrevVolName       :
VolFile           : 0
LabelType         : Unknown 0
LabelSize         : 0
PoolName          :
MediaType         :
PoolType          :
HostName          :
Date label written: -4712-01-01 at 00:00
*label
Enter Volume Name: AAA543
01-Dec 17:47 btape JobId 0: Error: block.c:577 Write error at 0:0 on
device "drv1" (/dev/nst1). ERR=Input/output error.
01-Dec 17:48 btape JobId 0: Error: Backspace record at EOT failed.
ERR=Input/output error
Wrote Volume label for volume "AAA543".

dmesg says (as expected): 

[158529.011206] st1: Can't allocate 2097152 byte tape buffer.
[158544.348411] st: append_to_buffer offset overflow.
[158544.348416] st: append_to_buffer offset overflow.
[158544.348418] st: append_to_buffer offset overflow.
[158544.348419] st: append_to_buffer offset overflow.

Now a dd with 64kb blocksize behaves really strange:

root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=1
dd: reading `/dev/nst1': Device or resource busy
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.118717 s, 0.0 kB/s

ok, so some process must be using /dev/nst1, right?

root@shepherd:~# lsof  |grep st1

nope, nothing.

Subsequent dd's, only a few seconds later:

root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=1
0+0 records in
0+0 records out
0 bytes (0 B) copied, 4.64747 s, 0.0 kB/s
root@shepherd:~# echo $?
0

Jeha right, we successfully read EOF/EOT

root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=1
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.0041229 s, 0.0 kB/s

Possibly another EOT?

root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=1
dd: reading `/dev/nst1': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.0128587 s, 0.0 kB/s
root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=1
dd: reading `/dev/nst1': Input/output error
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000236144 s, 0.0 kB/s
root@shepherd:~# echo $?
1

Hm, now an I/O error! now dmesg has this to tell me:
[158651.882012] st1: Can't allocate 5085561 byte tape buffer.

Trying to write to the tape looks like below, which seems to match your
earlier description; ie 64/65k works, 128k blocksize works, 256k
blocksize and above don't work anymore. I wasn't able to reproduce not
being able to write with a 64k blocksize at the moment.

root@shepherd:~# lsof  |grep st1
root@shepherd:~# dd if=/dev/zero of=/dev/nst1 bs=65k count=100
100+0 records in
100+0 records out
6656000 bytes (6.7 MB) copied, 2.08872 s, 3.2 MB/s
root@shepherd:~# dd if=/dev/zero of=/dev/nst1 bs=64k count=100
100+0 records in
100+0 records out
6553600 bytes (6.6 MB) copied, 1.71815 s, 3.8 MB/s
root@shepherd:~# dd if=/dev/zero of=/dev/nst1 bs=512k count=100
dd: writing `/dev/nst1': Input/output error
1+0 records in
0+0 records out
0 bytes (0 B) copied, 1.82643 s, 0.0 kB/s
root@shepherd:~# dd if=/dev/zero of=/dev/nst1 bs=256k count=100
dd: writing `/dev/nst1': Input/output error
1+0 records in
0+0 records out
0 bytes (0 B) copied, 1.71959 s, 0.0 kB/s
root@shepherd:~# dd if=/dev/zero of=/dev/nst1 bs=128k count=100
100+0 records in
100+0 records out
13107200 bytes (13 MB) copied, 2.08911 s, 6.3 MB/s
root@shepherd:~# dd if=/dev/zero of=/dev/nst1 bs=64k count=100
100+0 records in
100+0 records out
6553600 bytes (6.6 MB) copied, 1.99401 s, 3.3 MB/s
root@shepherd:~# mt -f /dev/nst1 rewind
root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=100
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.00889507 s, 0.0 kB/s
root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=10^C
root@shepherd:~# lsof  |grep st1
root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=512k count=100
dd: reading `/dev/nst1': Device or resource busy
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000232968 s, 0.0 kB/s
root@shepherd:~# lsof  |grep st1
root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=128k count=100
0+100 records in
0+100 records out
6656000 bytes (6.7 MB) copied, 0.314093 s, 21.2 MB/s
root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=256k count=100
dd: reading `/dev/nst1': Device or resource busy
0+0 records in
0+0 records out
0 bytes (0 B) copied, 0.000367819 s, 0.0 kB/s
root@shepherd:~# lsof  |grep st1
root@shepherd:~# 


Swap is used mainly because of overcommiting RAM to two VMs, but that
memory is rarely accessed.


root@shepherd:~# cat /proc/meminfo 
MemTotal:        8197296 kB
MemFree:           72648 kB
Buffers:           40496 kB
Cached:          1891664 kB
SwapCached:      1131684 kB
Active:          4258136 kB
Inactive:        3452272 kB
Active(anon):    4010488 kB
Inactive(anon):  1767884 kB
Active(file):     247648 kB
Inactive(file):  1684388 kB
Unevictable:         160 kB
Mlocked:             160 kB
SwapTotal:       4194300 kB
SwapFree:        1398976 kB
Dirty:            336920 kB
Writeback:             0 kB
AnonPages:       4648596 kB
Mapped:             4472 kB
Shmem:                12 kB
Slab:             155140 kB
SReclaimable:     109152 kB
SUnreclaim:        45988 kB
KernelStack:        1448 kB
PageTables:        18436 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8227412 kB
Committed_AS:    7884284 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       59244 kB
VmallocChunk:   34359660812 kB
HardwareCorrupted:     0 kB
HugePages_Total:      64
HugePages_Free:       64
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7488 kB
DirectMap2M:     8380416 kB

I hope this somehow helps,
Lukas



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-30 19:53               ` Kai Makisara
  2010-12-01  9:40                 ` Lukas Kolbe
@ 2010-12-02 10:01                 ` Lukas Kolbe
  1 sibling, 0 replies; 38+ messages in thread
From: Lukas Kolbe @ 2010-12-02 10:01 UTC (permalink / raw)
  To: Kai Makisara; +Cc: Boaz Harrosh, linux-scsi, Kashyap Desai

Am Dienstag, den 30.11.2010, 21:53 +0200 schrieb Kai Makisara:
> On Tue, 30 Nov 2010, Boaz Harrosh wrote:
> 
> ...
> > I looked at enlarge_buffer() and it looks fragile and broken. If you really
> > need a pointer eg:
> > 	STbuffer->b_data = page_address(STbuffer->reserved_pages[0]);
> > 
> If you think it is broken, please fix it.
> 
> > Than way not use vmalloc() for buffers larger then PAGE_SIZE? But better yet
> > avoid it by keeping a pages_array or sg-list and operate on an aio type
> > operations.
> > 
> vmalloc() is not a solution here. Think about this from the HBA side. Each 
> s/g segment must be contiguous in the address space the HBA uses. In many 
> cases this is the physical memory address space. Any solution must make 
> sure that the HBA can perform the requested data transfer.
> 
> > > Kai
> > 
> > But I understand this is a lot of work on an old driver. Perhaps pre-allocate
> > something big at startup. specified by user?
> > 
> This used to be possible at some time and it could be made possible again. 
> But I don't like this option because it means that the users must 
> explicitly set the boot parameters.
> 
> And it is difficult for me to believe the modern SAS HBAs only support 128 
> s/g segments.

I'll go ahead and file a bug about this. Maybe this gets it more
attention? At the moment, linux 2.6.32-36 is pretty much useless for
tape-based backups for us.

What's interesting is that when this happens, one of the two tape drives
seems to work, where for the second one the system can't allocate large
enough buffers. 

> Kai

-- 
Lukas




^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: After memory pressure: can't read from tape anymore
  2010-12-01  9:40                 ` Lukas Kolbe
@ 2010-12-02 11:17                   ` Desai, Kashyap
  2010-12-02 16:22                     ` Kai Makisara
  0 siblings, 1 reply; 38+ messages in thread
From: Desai, Kashyap @ 2010-12-02 11:17 UTC (permalink / raw)
  To: Lukas Kolbe, Kai Makisara; +Cc: Boaz Harrosh, linux-scsi



> -----Original Message-----
> From: Lukas Kolbe [mailto:lkolbe@techfak.uni-bielefeld.de]
> Sent: Wednesday, December 01, 2010 3:10 PM
> To: Kai Makisara
> Cc: Boaz Harrosh; linux-scsi@vger.kernel.org; Desai, Kashyap
> Subject: Re: After memory pressure: can't read from tape anymore
> 
> Am Dienstag, den 30.11.2010, 21:53 +0200 schrieb Kai Makisara:
> > On Tue, 30 Nov 2010, Boaz Harrosh wrote:
> 
> I'm Cc'ing Desay Kashyap from LSI, maybe he can comment on the hardware
> limitations of the SAS1068E?
Lukas,

No. it is not limitation from h/w that " CONFIG_FUSION_MAX_SGE" needs to be 128.
But our code is written such a way that even if you change it more than 128, it will fall down to 128 again.

To change this value you need to do below changes in mptbase.h

--
-#define MPT_SCSI_SG_DEPTH     CONFIG_FUSION_MAX_SGE
+#define MPT_SCSI_SG_DEPTH       256
--

128 is good amount for Scatter gather element. This value is standard value for MPT FUSIION, since long.

This value will be reflect to sg_tablesize and linux scatter-gather module will use this value for creating sg_table for HBA.
See: " cat /sys/class/scsi_host/host<x>/sg_tablesize"

If single IO is not able to fit into sg_tablesize, then it will be converted into multiple IOs for Low Layer Drivers(By "scatter-gather" module of linux).
So I do not see any problem with 
CONFIG_FUSION_MAX_SGE value.  Our driver internally convert sglist into SGE which understood by LSI H/W.

Thanks, Kashyap


> 
> > ...
> > > I looked at enlarge_buffer() and it looks fragile and broken. If
> you really
> > > need a pointer eg:
> > > 	STbuffer->b_data = page_address(STbuffer->reserved_pages[0]);
> > >
> > If you think it is broken, please fix it.
> >
> > > Than way not use vmalloc() for buffers larger then PAGE_SIZE? But
> better yet
> > > avoid it by keeping a pages_array or sg-list and operate on an aio
> type
> > > operations.
> > >
> > vmalloc() is not a solution here. Think about this from the HBA side.
> Each
> > s/g segment must be contiguous in the address space the HBA uses. In
> many
> > cases this is the physical memory address space. Any solution must
> make
> > sure that the HBA can perform the requested data transfer.
> >
> > > > Kai
> > >
> > > But I understand this is a lot of work on an old driver. Perhaps
> pre-allocate
> > > something big at startup. specified by user?
> > >
> > This used to be possible at some time and it could be made possible
> again.
> > But I don't like this option because it means that the users must
> > explicitly set the boot parameters.
> >
> > And it is difficult for me to believe the modern SAS HBAs only
> support 128
> > s/g segments.
> >
> > Kai
> 
> 
> For reference, here's my original message with Kais reply:
> 
> > Hi,
> >
> > On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E,
> > Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on
> > debian/squeeze), we see reproducible tape read and write failures
> after
> > the system was under memory pressure:
> >
> > [342567.297152] st0: Can't allocate 2097152 byte tape buffer.
> > [342569.316099] st0: Can't allocate 2097152 byte tape buffer.
> > [342570.805164] st0: Can't allocate 2097152 byte tape buffer.
> > [342571.958331] st0: Can't allocate 2097152 byte tape buffer.
> > [342572.704264] st0: Can't allocate 2097152 byte tape buffer.
> > [342873.737130] st: from_buffer offset overflow.
> >
> > Bacula is spewing this message every time it tries to access the tape
> > drive:
> > 28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error
> on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output
> error
> >
> > By memory pressure, I mean that the KVM processes containing the
> > postgres-db (~20million files) and the bacula director have used all
> > available RAM, one of them used ~4GiB of its 12GiB swap for an hour
> or
> > so (by selecting a full restore, it seems that the whole directory
> tree
> > of the 15mio files backup gets read into memory). After this, I
> wasn't
> > able to read from the second tape drive anymore (/dev/st0); whereas
> the
> > first tape drive was restoring the data happily (it is currently
> about
> > halfway through a 3TiB restore from 5 tapes).
> >
> > This same behaviour appears when we're doing a few incremental
> backups;
> > after a while, it just isn't possible to use the tape drives anymore
> -
> > every I/O operation gives an I/O Error, even a simple dd bs=64k
> > count=10. After a restart, the system behaves correctly until
> > -seemingly- another memory pressure situation occured.
> >
> This is predictable. The maximum number of scatter/gather segments
> seems
> to be 128. The st driver first tries to set up transfer directly from
> the
> user buffer to the HBA. The user buffer is usually fragmented so that
> one
> scatter/gather segment is used for each page. Assuming 4 kB page size,
> the
> maximu size of the direct transfer is 128 x 4 = 512 kB.
> 
> When this fails, the driver tries to allocate a kernel buffer so that
> there larger than 4 kB physically contiguous segments. Let's assume
> that
> it can find 128 16 kB segments. In this case the maximum block size is
> 2048 kB. Memory pressure results in memory fragmentation and the driver
> can't find large enough segments and allocation fails. This is what you
> are seeing.
> 
> So, one solution is to use 512 kB block size. Another one is to try to
> find out if the 128 segment limit is a physical limitation or just a
> choice. In the latter case the mptsas driver could be modified to
> support
> larger block size even after memory fragmentation.
> 
> Kai
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: After memory pressure: can't read from tape anymore
  2010-12-02 11:17                   ` Desai, Kashyap
@ 2010-12-02 16:22                     ` Kai Makisara
  2010-12-02 18:14                       ` Desai, Kashyap
  2010-12-03 10:13                       ` FUJITA Tomonori
  0 siblings, 2 replies; 38+ messages in thread
From: Kai Makisara @ 2010-12-02 16:22 UTC (permalink / raw)
  To: Desai, Kashyap; +Cc: Lukas Kolbe, Boaz Harrosh, linux-scsi

On Thu, 2 Dec 2010, Desai, Kashyap wrote:

> 
> 
> > -----Original Message-----
> > From: Lukas Kolbe [mailto:lkolbe@techfak.uni-bielefeld.de]
> > Sent: Wednesday, December 01, 2010 3:10 PM
> > To: Kai Makisara
> > Cc: Boaz Harrosh; linux-scsi@vger.kernel.org; Desai, Kashyap
> > Subject: Re: After memory pressure: can't read from tape anymore
> > 
> > Am Dienstag, den 30.11.2010, 21:53 +0200 schrieb Kai Makisara:
> > > On Tue, 30 Nov 2010, Boaz Harrosh wrote:
> > 
> > I'm Cc'ing Desay Kashyap from LSI, maybe he can comment on the hardware
> > limitations of the SAS1068E?
> Lukas,
> 
> No. it is not limitation from h/w that " CONFIG_FUSION_MAX_SGE" needs to be 128.
> But our code is written such a way that even if you change it more than 128, it will fall down to 128 again.
> 
> To change this value you need to do below changes in mptbase.h
> 
> --
> -#define MPT_SCSI_SG_DEPTH     CONFIG_FUSION_MAX_SGE
> +#define MPT_SCSI_SG_DEPTH       256
> --
> 
> 128 is good amount for Scatter gather element. This value is standard value for MPT FUSIION, since long.
> 
> This value will be reflect to sg_tablesize and linux scatter-gather module will use this value for creating sg_table for HBA.
> See: " cat /sys/class/scsi_host/host<x>/sg_tablesize"
> 
> If single IO is not able to fit into sg_tablesize, then it will be converted into multiple IOs for Low Layer Drivers(By "scatter-gather" module of linux).
> So I do not see any problem with 
> CONFIG_FUSION_MAX_SGE value.  Our driver internally convert sglist into SGE which understood by LSI H/W.
> 
You can't convert write of one block into multiple IOs. If someone wants 
to write 2 MB blocks, the system must transfer 2 MB in one IO. The choices 
are:

1. Direct I/O from user space. With 4 kB page size, this means that 513 
s/g segments (512 if the user buffer page is aligned) must be used. 
(Unless there is an IOMMU and an API for an ULD to use it.)

2. Use a bounce buffer. Dynamic allocation of the bounce buffer with a 
small number of segments is problematic when the system has been running 
for a long time but the buffer can be allocated when the device is 
detected. (This is problematic if the device is detected at run-time.)

I don't know which alternative is more efficient from the HW point of view 
if both are possible. Do you (or someone else) have any opinion on this?

Kai

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-01 17:06       ` Lukas Kolbe
@ 2010-12-02 16:41         ` Kai Makisara
  2010-12-06  7:59           ` Kai Makisara
  0 siblings, 1 reply; 38+ messages in thread
From: Kai Makisara @ 2010-12-02 16:41 UTC (permalink / raw)
  To: Lukas Kolbe; +Cc: linux-scsi, Kashyap Desai

On Wed, 1 Dec 2010, Lukas Kolbe wrote:

> Am Dienstag, den 30.11.2010, 18:20 +0200 schrieb Kai Makisara:
> 
...
> > If you see error with 64 kB block size, I would like to see any messages 
> > associated with these errors.
> 
> I have now hit this bug again. Trying to read and write a label from the
> tape drive in question results in this (via bacula's btape command):
> 
> *readlabel
> 01-Dec 17:47 btape JobId 0: Error: block.c:1002 Read error on fd=3 at
> file:blk 0:0 on device "drv1" (/dev/nst1). ERR=Value too large for
> defined data type.
> btape: btape.c:525 Volume has no label.
> 
> Volume Label:
> Id                : **error**VerNo             : 0
> VolName           :
> PrevVolName       :
> VolFile           : 0
> LabelType         : Unknown 0
> LabelSize         : 0
> PoolName          :
> MediaType         :
> PoolType          :
> HostName          :
> Date label written: -4712-01-01 at 00:00
> *label
> Enter Volume Name: AAA543
> 01-Dec 17:47 btape JobId 0: Error: block.c:577 Write error at 0:0 on
> device "drv1" (/dev/nst1). ERR=Input/output error.
> 01-Dec 17:48 btape JobId 0: Error: Backspace record at EOT failed.
> ERR=Input/output error
> Wrote Volume label for volume "AAA543".
> 
> dmesg says (as expected): 
> 
> [158529.011206] st1: Can't allocate 2097152 byte tape buffer.
> [158544.348411] st: append_to_buffer offset overflow.
> [158544.348416] st: append_to_buffer offset overflow.
> [158544.348418] st: append_to_buffer offset overflow.
> [158544.348419] st: append_to_buffer offset overflow.
> 
The messages except the first one are something that should never happen. 
I think that there is something wrong with returning from enlarge_buffer() 
when it fails. I will look at this when I have time.

...
> Trying to write to the tape looks like below, which seems to match your
> earlier description; ie 64/65k works, 128k blocksize works, 256k
> blocksize and above don't work anymore. I wasn't able to reproduce not
> being able to write with a 64k blocksize at the moment.
> 
Assuming that the MPT driver is configured for 128 segments, you should 
always be able to read and write 512 kB blocks.

...
> I hope this somehow helps,
> Lukas
> 
These messages are helpful.

Thanks,
Kai


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: After memory pressure: can't read from tape anymore
  2010-12-02 16:22                     ` Kai Makisara
@ 2010-12-02 18:14                       ` Desai, Kashyap
  2010-12-02 20:25                         ` Kai Makisara
  2010-12-03 10:13                       ` FUJITA Tomonori
  1 sibling, 1 reply; 38+ messages in thread
From: Desai, Kashyap @ 2010-12-02 18:14 UTC (permalink / raw)
  To: Kai Makisara; +Cc: Lukas Kolbe, Boaz Harrosh, linux-scsi



> -----Original Message-----
> From: Kai Makisara [mailto:Kai.Makisara@kolumbus.fi]
> Sent: Thursday, December 02, 2010 9:53 PM
> To: Desai, Kashyap
> Cc: Lukas Kolbe; Boaz Harrosh; linux-scsi@vger.kernel.org
> Subject: RE: After memory pressure: can't read from tape anymore
> 
> On Thu, 2 Dec 2010, Desai, Kashyap wrote:
> 
> >
> >
> > > -----Original Message-----
> > > From: Lukas Kolbe [mailto:lkolbe@techfak.uni-bielefeld.de]
> > > Sent: Wednesday, December 01, 2010 3:10 PM
> > > To: Kai Makisara
> > > Cc: Boaz Harrosh; linux-scsi@vger.kernel.org; Desai, Kashyap
> > > Subject: Re: After memory pressure: can't read from tape anymore
> > >
> > > Am Dienstag, den 30.11.2010, 21:53 +0200 schrieb Kai Makisara:
> > > > On Tue, 30 Nov 2010, Boaz Harrosh wrote:
> > >
> > > I'm Cc'ing Desay Kashyap from LSI, maybe he can comment on the
> hardware
> > > limitations of the SAS1068E?
> > Lukas,
> >
> > No. it is not limitation from h/w that " CONFIG_FUSION_MAX_SGE" needs
> to be 128.
> > But our code is written such a way that even if you change it more
> than 128, it will fall down to 128 again.
> >
> > To change this value you need to do below changes in mptbase.h
> >
> > --
> > -#define MPT_SCSI_SG_DEPTH     CONFIG_FUSION_MAX_SGE
> > +#define MPT_SCSI_SG_DEPTH       256
> > --
> >
> > 128 is good amount for Scatter gather element. This value is standard
> value for MPT FUSIION, since long.
> >
> > This value will be reflect to sg_tablesize and linux scatter-gather
> module will use this value for creating sg_table for HBA.
> > See: " cat /sys/class/scsi_host/host<x>/sg_tablesize"
> >
> > If single IO is not able to fit into sg_tablesize, then it will be
> converted into multiple IOs for Low Layer Drivers(By "scatter-gather"
> module of linux).
> > So I do not see any problem with
> > CONFIG_FUSION_MAX_SGE value.  Our driver internally convert sglist
> into SGE which understood by LSI H/W.
> >
> You can't convert write of one block into multiple IOs. If someone
> wants
> to write 2 MB blocks, the system must transfer 2 MB in one IO. The
> choices
> are:

I am not sure why single IO cannot be converted into multiple "IO" request.
If you run below commands
" sg_dd if=/dev/zero of=/dev/sdb bs=4800000 count=1"

You will see multiple IOs(requests) are coming to low layer driver. 

> 
> 1. Direct I/O from user space. With 4 kB page size, this means that 513
> s/g segments (512 if the user buffer page is aligned) must be used.
> (Unless there is an IOMMU and an API for an ULD to use it.)
> 
> 2. Use a bounce buffer. Dynamic allocation of the bounce buffer with a
> small number of segments is problematic when the system has been
> running
> for a long time but the buffer can be allocated when the device is
> detected. (This is problematic if the device is detected at run-time.)
> 
> I don't know which alternative is more efficient from the HW point of
> view
> if both are possible. Do you (or someone else) have any opinion on
> this?

Regarding above things, I am not sure too, why this has to be part of Low level driver?
It has to be done my block layer.... 
> 
> Kai

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: After memory pressure: can't read from tape anymore
  2010-12-02 18:14                       ` Desai, Kashyap
@ 2010-12-02 20:25                         ` Kai Makisara
  2010-12-05 10:44                           ` Lukas Kolbe
  0 siblings, 1 reply; 38+ messages in thread
From: Kai Makisara @ 2010-12-02 20:25 UTC (permalink / raw)
  To: Desai, Kashyap; +Cc: Lukas Kolbe, Boaz Harrosh, linux-scsi

On Thu, 2 Dec 2010, Desai, Kashyap wrote:

> 
> 
> > -----Original Message-----
> > From: Kai Makisara [mailto:Kai.Makisara@kolumbus.fi]
...
> > You can't convert write of one block into multiple IOs. If someone
> > wants
> > to write 2 MB blocks, the system must transfer 2 MB in one IO. The
> > choices
> > are:
> 
> I am not sure why single IO cannot be converted into multiple "IO" request.
> If you run below commands
> " sg_dd if=/dev/zero of=/dev/sdb bs=4800000 count=1"
> 
> You will see multiple IOs(requests) are coming to low layer driver. 
> 
Yes, but each one is writing one or more disk blocks of 512 bytes. You 
don't see writes of partial blocks. If the block size of the device is
2  MB, the minimun IO request is 2 MB. (The SCSI commands for sequential 
access devices are different from the commands for block devices, but the 
minimum read/write unit is one block in both cases.)

Kai

P.S. Why is it necessary to use 2 MB blocks? Some people say that it is 
the optimal block size for some current tape drives.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-30 17:24             ` Boaz Harrosh
  2010-11-30 19:53               ` Kai Makisara
@ 2010-12-03  9:44               ` FUJITA Tomonori
  1 sibling, 0 replies; 38+ messages in thread
From: FUJITA Tomonori @ 2010-12-03  9:44 UTC (permalink / raw)
  To: bharrosh; +Cc: Kai.Makisara, lkolbe, linux-scsi

On Tue, 30 Nov 2010 19:24:25 +0200
Boaz Harrosh <bharrosh@panasas.com> wrote:

> I looked at enlarge_buffer() and it looks fragile and broken. If you really
> need a pointer eg:
> 	STbuffer->b_data = page_address(STbuffer->reserved_pages[0]);
> 
> Than way not use vmalloc() for buffers larger then PAGE_SIZE? But better yet

As Kai said, this buffer is used for dma so you can't use vmalloc.

sg drivers keeps an array for pages. b_data is used for some commands
that do small data transfer (< PAGE_SIZE). So the driver exploits the
first page for it.

> avoid it by keeping a pages_array or sg-list and operate on an aio type
> operations.

As explained, the driver keeps a pages_array.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: After memory pressure: can't read from tape anymore
  2010-12-02 16:22                     ` Kai Makisara
  2010-12-02 18:14                       ` Desai, Kashyap
@ 2010-12-03 10:13                       ` FUJITA Tomonori
  2010-12-03 10:45                         ` Desai, Kashyap
  1 sibling, 1 reply; 38+ messages in thread
From: FUJITA Tomonori @ 2010-12-03 10:13 UTC (permalink / raw)
  To: Kai.Makisara; +Cc: Kashyap.Desai, lkolbe, bharrosh, linux-scsi

On Thu, 2 Dec 2010 18:22:33 +0200 (EET)
Kai Makisara <Kai.Makisara@kolumbus.fi> wrote:

> > -#define MPT_SCSI_SG_DEPTH     CONFIG_FUSION_MAX_SGE
> > +#define MPT_SCSI_SG_DEPTH       256
> > --
> > 
> > 128 is good amount for Scatter gather element. This value is standard value for MPT FUSIION, since long.
> > 
> > This value will be reflect to sg_tablesize and linux scatter-gather module will use this value for creating sg_table for HBA.
> > See: " cat /sys/class/scsi_host/host<x>/sg_tablesize"
> > 
> > If single IO is not able to fit into sg_tablesize, then it will be converted into multiple IOs for Low Layer Drivers(By "scatter-gather" module of linux).
> > So I do not see any problem with 
> > CONFIG_FUSION_MAX_SGE value.  Our driver internally convert sglist into SGE which understood by LSI H/W.
> > 
> You can't convert write of one block into multiple IOs. If someone wants 
> to write 2 MB blocks, the system must transfer 2 MB in one IO. The choices 
> are:

I'm not sure that Kashyap is talking about multiple IO requests.

SGE is LSI H/W's data structure to describe the set of a dma address
and a transfer length. LSI H/W can chain SGE so if you pass large
scatter-gather to the driver, it leads to single SCSI command with
changed multiple SGEs. I suppose mpt2sas can handle more than 256
scatter gatters.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: After memory pressure: can't read from tape anymore
  2010-12-03 10:13                       ` FUJITA Tomonori
@ 2010-12-03 10:45                         ` Desai, Kashyap
  2010-12-03 11:11                           ` FUJITA Tomonori
  0 siblings, 1 reply; 38+ messages in thread
From: Desai, Kashyap @ 2010-12-03 10:45 UTC (permalink / raw)
  To: FUJITA Tomonori, Kai.Makisara; +Cc: lkolbe, bharrosh, linux-scsi



> -----Original Message-----
> From: FUJITA Tomonori [mailto:fujita.tomonori@lab.ntt.co.jp]
> Sent: Friday, December 03, 2010 3:43 PM
> To: Kai.Makisara@kolumbus.fi
> Cc: Desai, Kashyap; lkolbe@techfak.uni-bielefeld.de;
> bharrosh@panasas.com; linux-scsi@vger.kernel.org
> Subject: RE: After memory pressure: can't read from tape anymore
> 
> On Thu, 2 Dec 2010 18:22:33 +0200 (EET)
> Kai Makisara <Kai.Makisara@kolumbus.fi> wrote:
> 
> > > -#define MPT_SCSI_SG_DEPTH     CONFIG_FUSION_MAX_SGE
> > > +#define MPT_SCSI_SG_DEPTH       256
> > > --
> > >
> > > 128 is good amount for Scatter gather element. This value is
> standard value for MPT FUSIION, since long.
> > >
> > > This value will be reflect to sg_tablesize and linux scatter-gather
> module will use this value for creating sg_table for HBA.
> > > See: " cat /sys/class/scsi_host/host<x>/sg_tablesize"
> > >
> > > If single IO is not able to fit into sg_tablesize, then it will be
> converted into multiple IOs for Low Layer Drivers(By "scatter-gather"
> module of linux).
> > > So I do not see any problem with
> > > CONFIG_FUSION_MAX_SGE value.  Our driver internally convert sglist
> into SGE which understood by LSI H/W.
> > >
> > You can't convert write of one block into multiple IOs. If someone
> wants
> > to write 2 MB blocks, the system must transfer 2 MB in one IO. The
> choices
> > are:
> 
> I'm not sure that Kashyap is talking about multiple IO requests.
> 
> SGE is LSI H/W's data structure to describe the set of a dma address
> and a transfer length. LSI H/W can chain SGE so if you pass large
> scatter-gather to the driver, it leads to single SCSI command with
> changed multiple SGEs. I suppose mpt2sas can handle more than 256
> scatter gatters.

Let me add some more input.

I do not see any problem with any parameters in driver is playing role in this issue.
As earlier suggested by Kai, they think changing " CONFIG_FUSION_MAX_SGE" will help.

But I see maximum value for " CONFIG_FUSION_MAX_SGE" is 128 only and we can not go beyond this.
I see scsi_alloc_queue below lines 

      blk_queue_max_hw_segments(q, shost->sg_tablesize);
        blk_queue_max_phys_segments(q, SCSI_MAX_SG_CHAIN_SEGMENTS);

both the values play role while selecting max scatter gather elements.. 
Where max_hw_segements are equal to value of " CONFIG_FUSION_MAX_SGE", but max_phys_segements are limited to  below value
#define SCSI_MAX_SG_SEGMENTS    128 (defined in scsi.h)

Even for mpt2sas " SCSI_MPT2SAS_MAX_SGE" is 128.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: After memory pressure: can't read from tape anymore
  2010-12-03 10:45                         ` Desai, Kashyap
@ 2010-12-03 11:11                           ` FUJITA Tomonori
  0 siblings, 0 replies; 38+ messages in thread
From: FUJITA Tomonori @ 2010-12-03 11:11 UTC (permalink / raw)
  To: Kashyap.Desai; +Cc: fujita.tomonori, Kai.Makisara, lkolbe, bharrosh, linux-scsi

On Fri, 3 Dec 2010 16:15:34 +0530
"Desai, Kashyap" <Kashyap.Desai@lsi.com> wrote:

> But I see maximum value for " CONFIG_FUSION_MAX_SGE" is 128 only and we can not go beyond this.
> I see scsi_alloc_queue below lines 
> 
>       blk_queue_max_hw_segments(q, shost->sg_tablesize);
>         blk_queue_max_phys_segments(q, SCSI_MAX_SG_CHAIN_SEGMENTS);
> 
> both the values play role while selecting max scatter gather elements.. 
> Where max_hw_segements are equal to value of " CONFIG_FUSION_MAX_SGE", but max_phys_segements are limited to  below value
> #define SCSI_MAX_SG_SEGMENTS    128 (defined in scsi.h)

Hmm, max_phys_segements is set to SCSI_MAX_SG_CHAIN_SEGMENTS (not
SCSI_MAX_SG_SEGMENTS).

SCSI_MAX_SG_CHAIN_SEGMENTS can be larger:

#ifdef ARCH_HAS_SG_CHAIN
#define SCSI_MAX_SG_CHAIN_SEGMENTS	2048
#else
#define SCSI_MAX_SG_CHAIN_SEGMENTS	SCSI_MAX_SG_SEGMENTS
#endif

So shost->sg_tablesize (SCSI_MPT2SAS_MAX_SGE) sets the limit here.

btw, max_hw_segments and max_phys_segments were consolidated.

With newer kernels, you can find the following in
__scsi_alloc_queue():

blk_queue_max_segments(q, min_t(unsigned short, shost->sg_tablesize,
			SCSI_MAX_SG_CHAIN_SEGMENTS));

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-11-29 17:09 ` Kai Makisara
  2010-11-30 13:31   ` Lukas Kolbe
@ 2010-12-03 12:27   ` FUJITA Tomonori
  2010-12-03 14:59     ` Kai Mäkisara
  1 sibling, 1 reply; 38+ messages in thread
From: FUJITA Tomonori @ 2010-12-03 12:27 UTC (permalink / raw)
  To: Kai.Makisara; +Cc: lkolbe, linux-scsi

On Mon, 29 Nov 2010 19:09:46 +0200 (EET)
Kai Makisara <Kai.Makisara@kolumbus.fi> wrote:

> > This same behaviour appears when we're doing a few incremental backups;
> > after a while, it just isn't possible to use the tape drives anymore -
> > every I/O operation gives an I/O Error, even a simple dd bs=64k
> > count=10. After a restart, the system behaves correctly until
> > -seemingly- another memory pressure situation occured.
> > 
> This is predictable. The maximum number of scatter/gather segments seems 
> to be 128. The st driver first tries to set up transfer directly from the 
> user buffer to the HBA. The user buffer is usually fragmented so that one 
> scatter/gather segment is used for each page. Assuming 4 kB page size, the 
> maximu size of the direct transfer is 128 x 4 = 512 kB.

Can we make enlarge_buffer friendly to the memory alloctor a bit?

His problem is that the driver can't allocate 2 mB with the hardware
limit 128 segments.

enlarge_buffer tries to use ST_MAX_ORDER and if the allocation (256 kB
page) fails, enlarge_buffer fails. It could try smaller order instead?

Not tested at all.


diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
index 5b7388f..119544b 100644
--- a/drivers/scsi/st.c
+++ b/drivers/scsi/st.c
@@ -3729,7 +3729,8 @@ static int enlarge_buffer(struct st_buffer * STbuffer, int new_size, int need_dm
 		b_size = PAGE_SIZE << order;
 	} else {
 		for (b_size = PAGE_SIZE, order = 0;
-		     order < ST_MAX_ORDER && b_size < new_size;
+		     order < ST_MAX_ORDER &&
+			     max_segs * (PAGE_SIZE << order) < new_size;
 		     order++, b_size *= 2)
 			;  /* empty */
 	}

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-03 12:27   ` FUJITA Tomonori
@ 2010-12-03 14:59     ` Kai Mäkisara
  2010-12-03 15:06       ` James Bottomley
  0 siblings, 1 reply; 38+ messages in thread
From: Kai Mäkisara @ 2010-12-03 14:59 UTC (permalink / raw)
  To: FUJITA Tomonori; +Cc: lkolbe, linux-scsi

On 12/03/2010 02:27 PM, FUJITA Tomonori wrote:
> On Mon, 29 Nov 2010 19:09:46 +0200 (EET)
> Kai Makisara<Kai.Makisara@kolumbus.fi>  wrote:
>
>>> This same behaviour appears when we're doing a few incremental backups;
>>> after a while, it just isn't possible to use the tape drives anymore -
>>> every I/O operation gives an I/O Error, even a simple dd bs=64k
>>> count=10. After a restart, the system behaves correctly until
>>> -seemingly- another memory pressure situation occured.
>>>
>> This is predictable. The maximum number of scatter/gather segments seems
>> to be 128. The st driver first tries to set up transfer directly from the
>> user buffer to the HBA. The user buffer is usually fragmented so that one
>> scatter/gather segment is used for each page. Assuming 4 kB page size, the
>> maximu size of the direct transfer is 128 x 4 = 512 kB.
>
> Can we make enlarge_buffer friendly to the memory alloctor a bit?
>
> His problem is that the driver can't allocate 2 mB with the hardware
> limit 128 segments.
>
> enlarge_buffer tries to use ST_MAX_ORDER and if the allocation (256 kB
> page) fails, enlarge_buffer fails. It could try smaller order instead?
>
> Not tested at all.
>
>
> diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> index 5b7388f..119544b 100644
> --- a/drivers/scsi/st.c
> +++ b/drivers/scsi/st.c
> @@ -3729,7 +3729,8 @@ static int enlarge_buffer(struct st_buffer * STbuffer, int new_size, int need_dm
>   		b_size = PAGE_SIZE<<  order;
>   	} else {
>   		for (b_size = PAGE_SIZE, order = 0;
> -		     order<  ST_MAX_ORDER&&  b_size<  new_size;
> +		     order<  ST_MAX_ORDER&&
> +			     max_segs * (PAGE_SIZE<<  order)<  new_size;
>   		     order++, b_size *= 2)
>   			;  /* empty */
>   	}

You are correct. The loop does not work at all as it should. Years ago,
the strategy was to start with as big blocks as possible to minimize the 
number s/g segments. Nowadays the segments must be of same size and the 
old logic is not applicable.

I have not tested the patch either but it looks correct.

Thanks for noticing this bug. I hope this helps the users. The question 
about number of s/g segments is still valid for the direct i/o case but 
that is optimization and not whether one can read/write.

Thanks,
Kai


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-03 14:59     ` Kai Mäkisara
@ 2010-12-03 15:06       ` James Bottomley
  2010-12-03 17:03         ` Lukas Kolbe
  0 siblings, 1 reply; 38+ messages in thread
From: James Bottomley @ 2010-12-03 15:06 UTC (permalink / raw)
  To: Kai Mäkisara; +Cc: FUJITA Tomonori, lkolbe, linux-scsi

On Fri, 2010-12-03 at 16:59 +0200, Kai Mäkisara wrote:
> On 12/03/2010 02:27 PM, FUJITA Tomonori wrote:
> > On Mon, 29 Nov 2010 19:09:46 +0200 (EET)
> > Kai Makisara<Kai.Makisara@kolumbus.fi>  wrote:
> >
> >>> This same behaviour appears when we're doing a few incremental backups;
> >>> after a while, it just isn't possible to use the tape drives anymore -
> >>> every I/O operation gives an I/O Error, even a simple dd bs=64k
> >>> count=10. After a restart, the system behaves correctly until
> >>> -seemingly- another memory pressure situation occured.
> >>>
> >> This is predictable. The maximum number of scatter/gather segments seems
> >> to be 128. The st driver first tries to set up transfer directly from the
> >> user buffer to the HBA. The user buffer is usually fragmented so that one
> >> scatter/gather segment is used for each page. Assuming 4 kB page size, the
> >> maximu size of the direct transfer is 128 x 4 = 512 kB.
> >
> > Can we make enlarge_buffer friendly to the memory alloctor a bit?
> >
> > His problem is that the driver can't allocate 2 mB with the hardware
> > limit 128 segments.
> >
> > enlarge_buffer tries to use ST_MAX_ORDER and if the allocation (256 kB
> > page) fails, enlarge_buffer fails. It could try smaller order instead?
> >
> > Not tested at all.
> >
> >
> > diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> > index 5b7388f..119544b 100644
> > --- a/drivers/scsi/st.c
> > +++ b/drivers/scsi/st.c
> > @@ -3729,7 +3729,8 @@ static int enlarge_buffer(struct st_buffer * STbuffer, int new_size, int need_dm
> >   		b_size = PAGE_SIZE<<  order;
> >   	} else {
> >   		for (b_size = PAGE_SIZE, order = 0;
> > -		     order<  ST_MAX_ORDER&&  b_size<  new_size;
> > +		     order<  ST_MAX_ORDER&&
> > +			     max_segs * (PAGE_SIZE<<  order)<  new_size;
> >   		     order++, b_size *= 2)
> >   			;  /* empty */
> >   	}
> 
> You are correct. The loop does not work at all as it should. Years ago,
> the strategy was to start with as big blocks as possible to minimize the 
> number s/g segments. Nowadays the segments must be of same size and the 
> old logic is not applicable.
> 
> I have not tested the patch either but it looks correct.
> 
> Thanks for noticing this bug. I hope this helps the users. The question 
> about number of s/g segments is still valid for the direct i/o case but 
> that is optimization and not whether one can read/write.

Realistically, though, this will only increase the probability of making
an allocation work, we can't get this to a certainty.

Since we fixed up the infrastructure to allow arbitrary length sg lists,
perhaps we should document what cards can actually take advantage of
this (and how to do so, since it's not set automatically on boot).  That
way users wanting tapes at least know what the problems are likely to be
and how to avoid them in their hardware purchasing decisions.  The
corollary is that we should likely have a list of not recommended cards:
if they can't go over 128 SG elements, then they're pretty much
unsuitable for modern tapes.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-03 15:06       ` James Bottomley
@ 2010-12-03 17:03         ` Lukas Kolbe
  2010-12-03 18:10           ` James Bottomley
  0 siblings, 1 reply; 38+ messages in thread
From: Lukas Kolbe @ 2010-12-03 17:03 UTC (permalink / raw)
  To: James Bottomley
  Cc: Kai Mäkisara, FUJITA Tomonori, linux-scsi, Kashyap Desai

Am Freitag, den 03.12.2010, 09:06 -0600 schrieb James Bottomley:
> On Fri, 2010-12-03 at 16:59 +0200, Kai Mäkisara wrote:
> > On 12/03/2010 02:27 PM, FUJITA Tomonori wrote:
> > >
> > > Can we make enlarge_buffer friendly to the memory alloctor a bit?
> > >
> > > His problem is that the driver can't allocate 2 mB with the hardware
> > > limit 128 segments.
> > >
> > > enlarge_buffer tries to use ST_MAX_ORDER and if the allocation (256 kB
> > > page) fails, enlarge_buffer fails. It could try smaller order instead?
> > >
> > > Not tested at all.
> > >
> > >
> > > diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> > > index 5b7388f..119544b 100644
> > > --- a/drivers/scsi/st.c
> > > +++ b/drivers/scsi/st.c
> > > @@ -3729,7 +3729,8 @@ static int enlarge_buffer(struct st_buffer * STbuffer, int new_size, int need_dm
> > >   		b_size = PAGE_SIZE<<  order;
> > >   	} else {
> > >   		for (b_size = PAGE_SIZE, order = 0;
> > > -		     order<  ST_MAX_ORDER&&  b_size<  new_size;
> > > +		     order<  ST_MAX_ORDER&&
> > > +			     max_segs * (PAGE_SIZE<<  order)<  new_size;
> > >   		     order++, b_size *= 2)
> > >   			;  /* empty */
> > >   	}
> > 
> > You are correct. The loop does not work at all as it should. Years ago,
> > the strategy was to start with as big blocks as possible to minimize the 
> > number s/g segments. Nowadays the segments must be of same size and the 
> > old logic is not applicable.
> > 
> > I have not tested the patch either but it looks correct.
> > 
> > Thanks for noticing this bug. I hope this helps the users. The question 
> > about number of s/g segments is still valid for the direct i/o case but 
> > that is optimization and not whether one can read/write.
> 
> Realistically, though, this will only increase the probability of making
> an allocation work, we can't get this to a certainty.
> 
> Since we fixed up the infrastructure to allow arbitrary length sg lists,
> perhaps we should document what cards can actually take advantage of
> this (and how to do so, since it's not set automatically on boot).  That
> way users wanting tapes at least know what the problems are likely to be
> and how to avoid them in their hardware purchasing decisions.  The
> corollary is that we should likely have a list of not recommended cards:
> if they can't go over 128 SG elements, then they're pretty much
> unsuitable for modern tapes.

Are you implying here that the LSI SAS1068E is unsuitable to drive two
LTO-4 tape drives? Or is it 'just' a problem with the driver?

I'll test both the above patch if it helps in our situation and report
back.

-- 
Lukas


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-03 17:03         ` Lukas Kolbe
@ 2010-12-03 18:10           ` James Bottomley
  2010-12-05 10:53             ` Lukas Kolbe
  2010-12-14 20:35             ` Vladislav Bolkhovitin
  0 siblings, 2 replies; 38+ messages in thread
From: James Bottomley @ 2010-12-03 18:10 UTC (permalink / raw)
  To: Lukas Kolbe; +Cc: Kai Mäkisara, FUJITA Tomonori, linux-scsi, Kashyap Desai

On Fri, 2010-12-03 at 18:03 +0100, Lukas Kolbe wrote:
> Am Freitag, den 03.12.2010, 09:06 -0600 schrieb James Bottomley:
> > On Fri, 2010-12-03 at 16:59 +0200, Kai Mäkisara wrote:
> > > On 12/03/2010 02:27 PM, FUJITA Tomonori wrote:
> > > >
> > > > Can we make enlarge_buffer friendly to the memory alloctor a bit?
> > > >
> > > > His problem is that the driver can't allocate 2 mB with the hardware
> > > > limit 128 segments.
> > > >
> > > > enlarge_buffer tries to use ST_MAX_ORDER and if the allocation (256 kB
> > > > page) fails, enlarge_buffer fails. It could try smaller order instead?
> > > >
> > > > Not tested at all.
> > > >
> > > >
> > > > diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> > > > index 5b7388f..119544b 100644
> > > > --- a/drivers/scsi/st.c
> > > > +++ b/drivers/scsi/st.c
> > > > @@ -3729,7 +3729,8 @@ static int enlarge_buffer(struct st_buffer * STbuffer, int new_size, int need_dm
> > > >   		b_size = PAGE_SIZE<<  order;
> > > >   	} else {
> > > >   		for (b_size = PAGE_SIZE, order = 0;
> > > > -		     order<  ST_MAX_ORDER&&  b_size<  new_size;
> > > > +		     order<  ST_MAX_ORDER&&
> > > > +			     max_segs * (PAGE_SIZE<<  order)<  new_size;
> > > >   		     order++, b_size *= 2)
> > > >   			;  /* empty */
> > > >   	}
> > > 
> > > You are correct. The loop does not work at all as it should. Years ago,
> > > the strategy was to start with as big blocks as possible to minimize the 
> > > number s/g segments. Nowadays the segments must be of same size and the 
> > > old logic is not applicable.
> > > 
> > > I have not tested the patch either but it looks correct.
> > > 
> > > Thanks for noticing this bug. I hope this helps the users. The question 
> > > about number of s/g segments is still valid for the direct i/o case but 
> > > that is optimization and not whether one can read/write.
> > 
> > Realistically, though, this will only increase the probability of making
> > an allocation work, we can't get this to a certainty.
> > 
> > Since we fixed up the infrastructure to allow arbitrary length sg lists,
> > perhaps we should document what cards can actually take advantage of
> > this (and how to do so, since it's not set automatically on boot).  That
> > way users wanting tapes at least know what the problems are likely to be
> > and how to avoid them in their hardware purchasing decisions.  The
> > corollary is that we should likely have a list of not recommended cards:
> > if they can't go over 128 SG elements, then they're pretty much
> > unsuitable for modern tapes.
> 
> Are you implying here that the LSI SAS1068E is unsuitable to drive two
> LTO-4 tape drives? Or is it 'just' a problem with the driver?

The information seems to be the former.  There's no way the kernel can
guarantee physical contiguity of memory as it operates.  We try to
defrag, but it's probabalistic, not certain, so if we have to try to
find a physically contiguous buffer to copy into for an operation like
this, at some point that allocation is going to fail.

The only way to be certain you can get a 2MB block down to a tape device
is to be able to transmit the whole thing as a SG list of fully
discontiguous pages.  On a system with 4k pages, that requires 512 SG
entries.  From what I've heard Kashyap say, that can't currently be done
on the 1068 because of firmware limitations (I'm not entirely clear on
this, but that's how it sounds to me ... if there is a way of making
firmware accept more than 128 SG elements per SCSI command, then it is a
fairly simple driver change).  This isn't something we can work around
in the driver because the transaction can't be split ... it has to go
down as a single WRITE command with a single output data buffer.

The LSI 1068 is an upgradeable firmware system, so it's always possible
LSI can come up with a firmware update that increases the size (this
would also require a corresponding driver change), but it doesn't sound
to be something that can be done in the driver alone.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: After memory pressure: can't read from tape anymore
  2010-12-02 20:25                         ` Kai Makisara
@ 2010-12-05 10:44                           ` Lukas Kolbe
  0 siblings, 0 replies; 38+ messages in thread
From: Lukas Kolbe @ 2010-12-05 10:44 UTC (permalink / raw)
  To: Kai Makisara; +Cc: Desai, Kashyap, Boaz Harrosh, linux-scsi

Am Donnerstag, den 02.12.2010, 22:25 +0200 schrieb Kai Makisara:

> P.S. Why is it necessary to use 2 MB blocks? Some people say that it is 
> the optimal block size for some current tape drives.

I really couldn't care less about blocksizes, but it seems to be that
it's impossible to reach high LTO4 write-speeds with lower blocksizes;
tests show that >100MiB/s are not possible with blocksizes around 64KB
to 512KB (from memory), so: the bigger the blocksize, the higher the
writing speed.

I suppose this gets even more critical with LTO5 drives.

-- 
Lukas



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-03 18:10           ` James Bottomley
@ 2010-12-05 10:53             ` Lukas Kolbe
  2010-12-05 12:16               ` FUJITA Tomonori
  2010-12-14 20:35             ` Vladislav Bolkhovitin
  1 sibling, 1 reply; 38+ messages in thread
From: Lukas Kolbe @ 2010-12-05 10:53 UTC (permalink / raw)
  To: James Bottomley
  Cc: Kai Mäkisara, FUJITA Tomonori, linux-scsi, Kashyap Desai

Am Freitag, den 03.12.2010, 12:10 -0600 schrieb James Bottomley:
> On Fri, 2010-12-03 at 18:03 +0100, Lukas Kolbe wrote:
> > Am Freitag, den 03.12.2010, 09:06 -0600 schrieb James Bottomley:
> > > On Fri, 2010-12-03 at 16:59 +0200, Kai Mäkisara wrote:
> > > > On 12/03/2010 02:27 PM, FUJITA Tomonori wrote:
> > > > >
> > > > > Can we make enlarge_buffer friendly to the memory alloctor a bit?
> > > > >
> > > > > His problem is that the driver can't allocate 2 mB with the hardware
> > > > > limit 128 segments.
> > > > >
> > > > > enlarge_buffer tries to use ST_MAX_ORDER and if the allocation (256 kB
> > > > > page) fails, enlarge_buffer fails. It could try smaller order instead?
> > > > >
> > > > > Not tested at all.
> > > > >
> > > > >
> > > > > diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c
> > > > > index 5b7388f..119544b 100644
> > > > > --- a/drivers/scsi/st.c
> > > > > +++ b/drivers/scsi/st.c
> > > > > @@ -3729,7 +3729,8 @@ static int enlarge_buffer(struct st_buffer * STbuffer, int new_size, int need_dm
> > > > >   		b_size = PAGE_SIZE<<  order;
> > > > >   	} else {
> > > > >   		for (b_size = PAGE_SIZE, order = 0;
> > > > > -		     order<  ST_MAX_ORDER&&  b_size<  new_size;
> > > > > +		     order<  ST_MAX_ORDER&&
> > > > > +			     max_segs * (PAGE_SIZE<<  order)<  new_size;
> > > > >   		     order++, b_size *= 2)
> > > > >   			;  /* empty */
> > > > >   	}
> > > > 
> > > > You are correct. The loop does not work at all as it should. Years ago,
> > > > the strategy was to start with as big blocks as possible to minimize the 
> > > > number s/g segments. Nowadays the segments must be of same size and the 
> > > > old logic is not applicable.
> > > > 
> > > > I have not tested the patch either but it looks correct.
> > > > 
> > > > Thanks for noticing this bug. I hope this helps the users. The question 
> > > > about number of s/g segments is still valid for the direct i/o case but 
> > > > that is optimization and not whether one can read/write.
> > > 
> > > Realistically, though, this will only increase the probability of making
> > > an allocation work, we can't get this to a certainty.
> > > 
> > > Since we fixed up the infrastructure to allow arbitrary length sg lists,
> > > perhaps we should document what cards can actually take advantage of
> > > this (and how to do so, since it's not set automatically on boot).  That
> > > way users wanting tapes at least know what the problems are likely to be
> > > and how to avoid them in their hardware purchasing decisions.  The
> > > corollary is that we should likely have a list of not recommended cards:
> > > if they can't go over 128 SG elements, then they're pretty much
> > > unsuitable for modern tapes.
> > 
> > Are you implying here that the LSI SAS1068E is unsuitable to drive two
> > LTO-4 tape drives? Or is it 'just' a problem with the driver?
> 
> The information seems to be the former.  There's no way the kernel can
> guarantee physical contiguity of memory as it operates.  We try to
> defrag, but it's probabalistic, not certain, so if we have to try to
> find a physically contiguous buffer to copy into for an operation like
> this, at some point that allocation is going to fail.
> 
> The only way to be certain you can get a 2MB block down to a tape device
> is to be able to transmit the whole thing as a SG list of fully
> discontiguous pages.  On a system with 4k pages, that requires 512 SG
> entries.  From what I've heard Kashyap say, that can't currently be done
> on the 1068 because of firmware limitations (I'm not entirely clear on
> this, but that's how it sounds to me ... if there is a way of making
> firmware accept more than 128 SG elements per SCSI command, then it is a
> fairly simple driver change).

Well, 2MB blocksizes actually do work - bacula is reporting a blocksize
of ~2MB for each drive while writing to it - only after there was memory
pressure and a new tape got inserted, it is *not* possible anymore to
write to the tape with these blocksizes, and dmesg tells me one of these
every time bacula tries to read from or write to a tape:

[101883.958351] st0: Can't allocate 2097152 byte tape buffer.
[103901.666608] st0: Can't allocate 10249541 byte tape buffer.

No idea why it's trying 10MB, though.

I tested with the patch from Fujita, and this messages from before
applying the patch: 

[158544.348411] st: append_to_buffer offset overflow.

do not appear anymore.
It didn't help on the not-being-able-to-write-after-memory-pressure
matter, though.

>  This isn't something we can work around
> in the driver because the transaction can't be split ... it has to go
> down as a single WRITE command with a single output data buffer.
> 
> The LSI 1068 is an upgradeable firmware system, so it's always possible
> LSI can come up with a firmware update that increases the size (this
> would also require a corresponding driver change), but it doesn't sound
> to be something that can be done in the driver alone.

If only LSI's website were a little more clear on where to find updated
firmware and what was the latest version :/.

-- 
Lukas


--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-05 10:53             ` Lukas Kolbe
@ 2010-12-05 12:16               ` FUJITA Tomonori
  0 siblings, 0 replies; 38+ messages in thread
From: FUJITA Tomonori @ 2010-12-05 12:16 UTC (permalink / raw)
  To: lkolbe
  Cc: James.Bottomley, kai.makisara, fujita.tomonori, linux-scsi,
	Kashyap.Desai

On Sun, 05 Dec 2010 11:53:03 +0100
Lukas Kolbe <lkolbe@techfak.uni-bielefeld.de> wrote:

> Well, 2MB blocksizes actually do work - bacula is reporting a blocksize
> of ~2MB for each drive while writing to it - only after there was memory
> pressure and a new tape got inserted, it is *not* possible anymore to
> write to the tape with these blocksizes, and dmesg tells me one of these
> every time bacula tries to read from or write to a tape:

I don't know how bacula works but I guess that it closes and reopen a
st device when you insert a new tape. The driver frees the memory when
it closes the device.


> [101883.958351] st0: Can't allocate 2097152 byte tape buffer.
> [103901.666608] st0: Can't allocate 10249541 byte tape buffer.
> 
> No idea why it's trying 10MB, though.
> 
> I tested with the patch from Fujita, and this messages from before
> applying the patch: 
> 
> [158544.348411] st: append_to_buffer offset overflow.
> 
> do not appear anymore.
> It didn't help on the not-being-able-to-write-after-memory-pressure
> matter, though.

There is no way to guarantee that we can allocate physically continuous
large memory.


> >  This isn't something we can work around
> > in the driver because the transaction can't be split ... it has to go
> > down as a single WRITE command with a single output data buffer.
> > 
> > The LSI 1068 is an upgradeable firmware system, so it's always possible
> > LSI can come up with a firmware update that increases the size (this
> > would also require a corresponding driver change), but it doesn't sound
> > to be something that can be done in the driver alone.
> 
> If only LSI's website were a little more clear on where to find updated
> firmware and what was the latest version :/.
n

I think that Desai said that your hardware can handle more sg
entries. You have a better chance to live with memory pressure. Try
the following patch with my patch. You need to set FUSION_MAX_SGE to
256 with kernel meunconfig (or whatever you like) and rebuild your
kernel. Make sure that the driver supports 256 entries like this:

fujita@calla:~$ cat /sys/class/scsi_host/host4/sg_tablesize
256

diff --git a/drivers/message/fusion/Kconfig b/drivers/message/fusion/Kconfig
index a34a11d..e70d65e 100644
--- a/drivers/message/fusion/Kconfig
+++ b/drivers/message/fusion/Kconfig
@@ -61,9 +61,9 @@ config FUSION_SAS
 	  LSISAS1078
 
 config FUSION_MAX_SGE
-	int "Maximum number of scatter gather entries (16 - 128)"
+	int "Maximum number of scatter gather entries (16 - 256)"
 	default "128"
-	range 16 128
+	range 16 256
 	help
 	  This option allows you to specify the maximum number of scatter-
 	  gather entries per I/O. The driver default is 128, which matches

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-02 16:41         ` Kai Makisara
@ 2010-12-06  7:59           ` Kai Makisara
  2010-12-06  8:50             ` FUJITA Tomonori
  2010-12-06  9:36             ` Lukas Kolbe
  0 siblings, 2 replies; 38+ messages in thread
From: Kai Makisara @ 2010-12-06  7:59 UTC (permalink / raw)
  To: Lukas Kolbe; +Cc: linux-scsi, FUJITA Tomonori

On Thu, 2 Dec 2010, Kai Makisara wrote:

> On Wed, 1 Dec 2010, Lukas Kolbe wrote:
> 
> > Am Dienstag, den 30.11.2010, 18:20 +0200 schrieb Kai Makisara:
> > 
> ...
> > > If you see error with 64 kB block size, I would like to see any messages 
> > > associated with these errors.
> > 
> > I have now hit this bug again. Trying to read and write a label from the
> > tape drive in question results in this (via bacula's btape command):
> > 
...
> > [158529.011206] st1: Can't allocate 2097152 byte tape buffer.
> > [158544.348411] st: append_to_buffer offset overflow.
> > [158544.348416] st: append_to_buffer offset overflow.
> > [158544.348418] st: append_to_buffer offset overflow.
> > [158544.348419] st: append_to_buffer offset overflow.
> > 
> The messages except the first one are something that should never happen. 
> I think that there is something wrong with returning from enlarge_buffer() 
> when it fails. I will look at this when I have time.
> 
OK, today I have had some time (national holiday). I think I have tracked 
down this problem. The patch at the end should fix it. Basically, 
normalize_buffer() needs to know the order of the pages in order to 
properly free the pages and update the buffer size. When allocation 
failed, the order was not yet stored into the tape buffer definition. This 
does explain the problem after allocation failed. Why allocation failed is 
another problem which has been discussed elsewhere.

Kai

-----------------------------------8<------------------------------------
--- linux-2.6.36.1/drivers/scsi/st.c.org	2010-12-05 17:07:04.285226110 +0200
+++ linux-2.6.36.1/drivers/scsi/st.c	2010-12-06 09:46:34.756000154 +0200
@@ -3729,6 +3729,7 @@ static int enlarge_buffer(struct st_buff
 		     order < ST_MAX_ORDER && b_size < new_size;
 		     order++, b_size *= 2)
 			;  /* empty */
+		STbuffer->reserved_page_order = order;
 	}
 	if (max_segs * (PAGE_SIZE << order) < new_size) {
 		if (order == ST_MAX_ORDER)
@@ -3755,7 +3756,6 @@ static int enlarge_buffer(struct st_buff
 		segs++;
 	}
 	STbuffer->b_data = page_address(STbuffer->reserved_pages[0]);
-	STbuffer->reserved_page_order = order;
 
 	return 1;
 }

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-06  7:59           ` Kai Makisara
@ 2010-12-06  8:50             ` FUJITA Tomonori
  2010-12-06  9:36             ` Lukas Kolbe
  1 sibling, 0 replies; 38+ messages in thread
From: FUJITA Tomonori @ 2010-12-06  8:50 UTC (permalink / raw)
  To: Kai.Makisara; +Cc: lkolbe, linux-scsi, fujita.tomonori

On Mon, 6 Dec 2010 09:59:11 +0200 (EET)
Kai Makisara <Kai.Makisara@kolumbus.fi> wrote:

> On Thu, 2 Dec 2010, Kai Makisara wrote:
> 
> > On Wed, 1 Dec 2010, Lukas Kolbe wrote:
> > 
> > > Am Dienstag, den 30.11.2010, 18:20 +0200 schrieb Kai Makisara:
> > > 
> > ...
> > > > If you see error with 64 kB block size, I would like to see any messages 
> > > > associated with these errors.
> > > 
> > > I have now hit this bug again. Trying to read and write a label from the
> > > tape drive in question results in this (via bacula's btape command):
> > > 
> ...
> > > [158529.011206] st1: Can't allocate 2097152 byte tape buffer.
> > > [158544.348411] st: append_to_buffer offset overflow.
> > > [158544.348416] st: append_to_buffer offset overflow.
> > > [158544.348418] st: append_to_buffer offset overflow.
> > > [158544.348419] st: append_to_buffer offset overflow.
> > > 
> > The messages except the first one are something that should never happen. 
> > I think that there is something wrong with returning from enlarge_buffer() 
> > when it fails. I will look at this when I have time.
> > 
> OK, today I have had some time (national holiday). I think I have tracked 
> down this problem. The patch at the end should fix it. Basically, 
> normalize_buffer() needs to know the order of the pages in order to 
> properly free the pages and update the buffer size. When allocation 
> failed, the order was not yet stored into the tape buffer definition. This 
> does explain the problem after allocation failed.

Ah, nice catch! I've not tested the patch but looks correct.


> Why allocation failed is 
> another problem which has been discussed elsewhere.

Yeah.

Thanks a lot!

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-06  7:59           ` Kai Makisara
  2010-12-06  8:50             ` FUJITA Tomonori
@ 2010-12-06  9:36             ` Lukas Kolbe
  2010-12-06 11:34               ` Bjørn Mork
  2010-12-08 14:19               ` Lukas Kolbe
  1 sibling, 2 replies; 38+ messages in thread
From: Lukas Kolbe @ 2010-12-06  9:36 UTC (permalink / raw)
  To: Kai Makisara; +Cc: linux-scsi, FUJITA Tomonori

On Mon, 2010-12-06 at 09:59 +0200, Kai Makisara wrote:
> On Thu, 2 Dec 2010, Kai Makisara wrote:

> OK, today I have had some time (national holiday). I think I have tracked 
> down this problem. The patch at the end should fix it. Basically, 
> normalize_buffer() needs to know the order of the pages in order to 
> properly free the pages and update the buffer size. When allocation 
> failed, the order was not yet stored into the tape buffer definition. This 
> does explain the problem after allocation failed.

Thank you, this does indeed look nasty (like a not easily catched bug).

> Why allocation failed is another problem which has been discussed elsewhere.

I just tonight discovered that Debian's .config has
CONFIG_FUSION_MAX_SGE=40. Right now I'm testing with the mptsas' drivers
maximum of 128 and see if I can reproduce the failure under memory
pressure again. Sorry for not catching this one earlier! After reading
the source, I had the assumption that it already was 128 ...

-- 
Lukas



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-06  9:36             ` Lukas Kolbe
@ 2010-12-06 11:34               ` Bjørn Mork
  2010-12-08 14:19               ` Lukas Kolbe
  1 sibling, 0 replies; 38+ messages in thread
From: Bjørn Mork @ 2010-12-06 11:34 UTC (permalink / raw)
  To: linux-scsi

Lukas Kolbe <lkolbe@techfak.uni-bielefeld.de> writes:

> I just tonight discovered that Debian's .config has
> CONFIG_FUSION_MAX_SGE=40. Right now I'm testing with the mptsas' drivers
> maximum of 128 and see if I can reproduce the failure under memory
> pressure again. Sorry for not catching this one earlier! After reading
> the source, I had the assumption that it already was 128 ...

So did I...  

I assume the difference is unintentional so I opened a Debian bug:
http://bugs.debian.org/606096


Bjørn

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-06  9:36             ` Lukas Kolbe
  2010-12-06 11:34               ` Bjørn Mork
@ 2010-12-08 14:19               ` Lukas Kolbe
  1 sibling, 0 replies; 38+ messages in thread
From: Lukas Kolbe @ 2010-12-08 14:19 UTC (permalink / raw)
  To: Kai Makisara; +Cc: linux-scsi, FUJITA Tomonori

On Mon, 2010-12-06 at 10:36 +0100, Lukas Kolbe wrote:
> On Mon, 2010-12-06 at 09:59 +0200, Kai Makisara wrote:
> > On Thu, 2 Dec 2010, Kai Makisara wrote:
> 
> > OK, today I have had some time (national holiday). I think I have tracked 
> > down this problem. The patch at the end should fix it. Basically, 
> > normalize_buffer() needs to know the order of the pages in order to 
> > properly free the pages and update the buffer size. When allocation 
> > failed, the order was not yet stored into the tape buffer definition. This 
> > does explain the problem after allocation failed.
> 
> Thank you, this does indeed look nasty (like a not easily catched bug).
> 
> > Why allocation failed is another problem which has been discussed elsewhere.
> 
> I just tonight discovered that Debian's .config has
> CONFIG_FUSION_MAX_SGE=40. Right now I'm testing with the mptsas' drivers
> maximum of 128 and see if I can reproduce the failure under memory
> pressure again. Sorry for not catching this one earlier! After reading
> the source, I had the assumption that it already was 128 ...

Debian has since fixed to the new default value here, and after a few
days of testing with CONFIG_FUSION_MAX_SGE=128 and your patch I wasn't
able to reproduce the failure anymore!

Kind regards,
Lukas



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-03 18:10           ` James Bottomley
  2010-12-05 10:53             ` Lukas Kolbe
@ 2010-12-14 20:35             ` Vladislav Bolkhovitin
  2010-12-14 22:23               ` Stephen Hemminger
  1 sibling, 1 reply; 38+ messages in thread
From: Vladislav Bolkhovitin @ 2010-12-14 20:35 UTC (permalink / raw)
  To: James Bottomley
  Cc: Lukas Kolbe, Kai Mäkisara, FUJITA Tomonori, linux-scsi,
	Kashyap Desai, netdev

James Bottomley, on 12/03/2010 09:10 PM wrote:
>>>> Thanks for noticing this bug. I hope this helps the users. The question 
>>>> about number of s/g segments is still valid for the direct i/o case but 
>>>> that is optimization and not whether one can read/write.
>>>
>>> Realistically, though, this will only increase the probability of making
>>> an allocation work, we can't get this to a certainty.
>>>
>>> Since we fixed up the infrastructure to allow arbitrary length sg lists,
>>> perhaps we should document what cards can actually take advantage of
>>> this (and how to do so, since it's not set automatically on boot).  That
>>> way users wanting tapes at least know what the problems are likely to be
>>> and how to avoid them in their hardware purchasing decisions. The
>>> corollary is that we should likely have a list of not recommended cards:
>>> if they can't go over 128 SG elements, then they're pretty much
>>> unsuitable for modern tapes.
>>
>> Are you implying here that the LSI SAS1068E is unsuitable to drive two
>> LTO-4 tape drives? Or is it 'just' a problem with the driver?
> 
> The information seems to be the former.  There's no way the kernel can
> guarantee physical contiguity of memory as it operates.  We try to
> defrag, but it's probabalistic, not certain, so if we have to try to
> find a physically contiguous buffer to copy into for an operation like
> this, at some point that allocation is going to fail.

What is interesting to me in this regard is how networking with 9K jumbo
frames manages to work acceptably reliable? Jumbo frames used
sufficiently often, including under high memory pressure.

I'm not a deep networking guru, but network drivers need to allocate
physically continual memory for skbs, which means 16K per 9K packet,
which means order 2 allocations per skb.

I guess, it works reliably, because for networking it is OK to drop an
incoming packet and retry allocation for the next one later.

If so, maybe similarly in this case it is worth to not return allocation
error immediately, but retry it several times after few seconds intervals?

Usually tape read/write operations have pretty big timeouts, like 60
seconds. In this time it is possible to retry 10 times in 5 seconds
between retries.

Vlad

> The only way to be certain you can get a 2MB block down to a tape device
> is to be able to transmit the whole thing as a SG list of fully
> discontiguous pages.  On a system with 4k pages, that requires 512 SG
> entries.  From what I've heard Kashyap say, that can't currently be done
> on the 1068 because of firmware limitations (I'm not entirely clear on
> this, but that's how it sounds to me ... if there is a way of making
> firmware accept more than 128 SG elements per SCSI command, then it is a
> fairly simple driver change).  This isn't something we can work around
> in the driver because the transaction can't be split ... it has to go
> down as a single WRITE command with a single output data buffer.
> 
> The LSI 1068 is an upgradeable firmware system, so it's always possible
> LSI can come up with a firmware update that increases the size (this
> would also require a corresponding driver change), but it doesn't sound
> to be something that can be done in the driver alone.
> 
> James

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-14 20:35             ` Vladislav Bolkhovitin
@ 2010-12-14 22:23               ` Stephen Hemminger
  2010-12-15 16:27                 ` Vladislav Bolkhovitin
  0 siblings, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2010-12-14 22:23 UTC (permalink / raw)
  To: Vladislav Bolkhovitin
  Cc: James Bottomley, Lukas Kolbe, Kai Mäkisara, FUJITA Tomonori,
	linux-scsi, Kashyap Desai, netdev

On Tue, 14 Dec 2010 23:35:37 +0300
Vladislav Bolkhovitin <vst@vlnb.net> wrote:

> What is interesting to me in this regard is how networking with 9K jumbo
> frames manages to work acceptably reliable? Jumbo frames used
> sufficiently often, including under high memory pressure.
> 
> I'm not a deep networking guru, but network drivers need to allocate
> physically continual memory for skbs, which means 16K per 9K packet,
> which means order 2 allocations per skb.

Good network drivers support fragmentation and allocate a small portion
for the header and allocate pages for the rest. This requires no higher
order allocation.  The networking stack takes fragmented data coming
in and does the necessary copy/merging to access contiguous headers.

There are still some crap network drivers that require large contiguous
allocation. These should not be used with jumbo frames in real
environments.

-- 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: After memory pressure: can't read from tape anymore
  2010-12-14 22:23               ` Stephen Hemminger
@ 2010-12-15 16:27                 ` Vladislav Bolkhovitin
  0 siblings, 0 replies; 38+ messages in thread
From: Vladislav Bolkhovitin @ 2010-12-15 16:27 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: James Bottomley, Lukas Kolbe, Kai Mäkisara, FUJITA Tomonori,
	linux-scsi, Kashyap Desai, netdev

Stephen Hemminger, on 12/15/2010 01:23 AM wrote:
> On Tue, 14 Dec 2010 23:35:37 +0300
> Vladislav Bolkhovitin <vst@vlnb.net> wrote:
> 
>> What is interesting to me in this regard is how networking with 9K jumbo
>> frames manages to work acceptably reliable? Jumbo frames used
>> sufficiently often, including under high memory pressure.
>>
>> I'm not a deep networking guru, but network drivers need to allocate
>> physically continual memory for skbs, which means 16K per 9K packet,
>> which means order 2 allocations per skb.
> 
> Good network drivers support fragmentation and allocate a small portion
> for the header and allocate pages for the rest. This requires no higher
> order allocation. The networking stack takes fragmented data coming
> in and does the necessary copy/merging to access contiguous headers.
> 
> There are still some crap network drivers that require large contiguous
> allocation. These should not be used with jumbo frames in real
> environments.

I see. Thanks for clarifying it.

Vlad




^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2010-12-15 16:27 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-28 19:15 After memory pressure: can't read from tape anymore Lukas Kolbe
2010-11-29 17:09 ` Kai Makisara
2010-11-30 13:31   ` Lukas Kolbe
2010-11-30 16:10     ` Boaz Harrosh
2010-11-30 16:23       ` Kai Makisara
2010-11-30 16:44         ` Boaz Harrosh
2010-11-30 17:04           ` Kai Makisara
2010-11-30 17:24             ` Boaz Harrosh
2010-11-30 19:53               ` Kai Makisara
2010-12-01  9:40                 ` Lukas Kolbe
2010-12-02 11:17                   ` Desai, Kashyap
2010-12-02 16:22                     ` Kai Makisara
2010-12-02 18:14                       ` Desai, Kashyap
2010-12-02 20:25                         ` Kai Makisara
2010-12-05 10:44                           ` Lukas Kolbe
2010-12-03 10:13                       ` FUJITA Tomonori
2010-12-03 10:45                         ` Desai, Kashyap
2010-12-03 11:11                           ` FUJITA Tomonori
2010-12-02 10:01                 ` Lukas Kolbe
2010-12-03  9:44               ` FUJITA Tomonori
2010-11-30 16:20     ` Kai Makisara
2010-12-01 17:06       ` Lukas Kolbe
2010-12-02 16:41         ` Kai Makisara
2010-12-06  7:59           ` Kai Makisara
2010-12-06  8:50             ` FUJITA Tomonori
2010-12-06  9:36             ` Lukas Kolbe
2010-12-06 11:34               ` Bjørn Mork
2010-12-08 14:19               ` Lukas Kolbe
2010-12-03 12:27   ` FUJITA Tomonori
2010-12-03 14:59     ` Kai Mäkisara
2010-12-03 15:06       ` James Bottomley
2010-12-03 17:03         ` Lukas Kolbe
2010-12-03 18:10           ` James Bottomley
2010-12-05 10:53             ` Lukas Kolbe
2010-12-05 12:16               ` FUJITA Tomonori
2010-12-14 20:35             ` Vladislav Bolkhovitin
2010-12-14 22:23               ` Stephen Hemminger
2010-12-15 16:27                 ` Vladislav Bolkhovitin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.