* After memory pressure: can't read from tape anymore @ 2010-11-28 19:15 Lukas Kolbe 2010-11-29 17:09 ` Kai Makisara 0 siblings, 1 reply; 38+ messages in thread From: Lukas Kolbe @ 2010-11-28 19:15 UTC (permalink / raw) To: linux-scsi Hi, On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E, Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on debian/squeeze), we see reproducible tape read and write failures after the system was under memory pressure: [342567.297152] st0: Can't allocate 2097152 byte tape buffer. [342569.316099] st0: Can't allocate 2097152 byte tape buffer. [342570.805164] st0: Can't allocate 2097152 byte tape buffer. [342571.958331] st0: Can't allocate 2097152 byte tape buffer. [342572.704264] st0: Can't allocate 2097152 byte tape buffer. [342873.737130] st: from_buffer offset overflow. Bacula is spewing this message every time it tries to access the tape drive: 28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output error By memory pressure, I mean that the KVM processes containing the postgres-db (~20million files) and the bacula director have used all available RAM, one of them used ~4GiB of its 12GiB swap for an hour or so (by selecting a full restore, it seems that the whole directory tree of the 15mio files backup gets read into memory). After this, I wasn't able to read from the second tape drive anymore (/dev/st0); whereas the first tape drive was restoring the data happily (it is currently about halfway through a 3TiB restore from 5 tapes). This same behaviour appears when we're doing a few incremental backups; after a while, it just isn't possible to use the tape drives anymore - every I/O operation gives an I/O Error, even a simple dd bs=64k count=10. After a restart, the system behaves correctly until -seemingly- another memory pressure situation occured. I'd be delighted if somebody can help me debug this; my systemtap skills are non-existent unfortunatly. kind regads, Lukas Kolbe ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-28 19:15 After memory pressure: can't read from tape anymore Lukas Kolbe @ 2010-11-29 17:09 ` Kai Makisara 2010-11-30 13:31 ` Lukas Kolbe 2010-12-03 12:27 ` FUJITA Tomonori 0 siblings, 2 replies; 38+ messages in thread From: Kai Makisara @ 2010-11-29 17:09 UTC (permalink / raw) To: Lukas Kolbe; +Cc: linux-scsi On Sun, 28 Nov 2010, Lukas Kolbe wrote: > Hi, > > On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E, > Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on > debian/squeeze), we see reproducible tape read and write failures after > the system was under memory pressure: > > [342567.297152] st0: Can't allocate 2097152 byte tape buffer. > [342569.316099] st0: Can't allocate 2097152 byte tape buffer. > [342570.805164] st0: Can't allocate 2097152 byte tape buffer. > [342571.958331] st0: Can't allocate 2097152 byte tape buffer. > [342572.704264] st0: Can't allocate 2097152 byte tape buffer. > [342873.737130] st: from_buffer offset overflow. > > Bacula is spewing this message every time it tries to access the tape > drive: > 28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output error > > By memory pressure, I mean that the KVM processes containing the > postgres-db (~20million files) and the bacula director have used all > available RAM, one of them used ~4GiB of its 12GiB swap for an hour or > so (by selecting a full restore, it seems that the whole directory tree > of the 15mio files backup gets read into memory). After this, I wasn't > able to read from the second tape drive anymore (/dev/st0); whereas the > first tape drive was restoring the data happily (it is currently about > halfway through a 3TiB restore from 5 tapes). > > This same behaviour appears when we're doing a few incremental backups; > after a while, it just isn't possible to use the tape drives anymore - > every I/O operation gives an I/O Error, even a simple dd bs=64k > count=10. After a restart, the system behaves correctly until > -seemingly- another memory pressure situation occured. > This is predictable. The maximum number of scatter/gather segments seems to be 128. The st driver first tries to set up transfer directly from the user buffer to the HBA. The user buffer is usually fragmented so that one scatter/gather segment is used for each page. Assuming 4 kB page size, the maximu size of the direct transfer is 128 x 4 = 512 kB. When this fails, the driver tries to allocate a kernel buffer so that there larger than 4 kB physically contiguous segments. Let's assume that it can find 128 16 kB segments. In this case the maximum block size is 2048 kB. Memory pressure results in memory fragmentation and the driver can't find large enough segments and allocation fails. This is what you are seeing. So, one solution is to use 512 kB block size. Another one is to try to find out if the 128 segment limit is a physical limitation or just a choice. In the latter case the mptsas driver could be modified to support larger block size even after memory fragmentation. Kai ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-29 17:09 ` Kai Makisara @ 2010-11-30 13:31 ` Lukas Kolbe 2010-11-30 16:10 ` Boaz Harrosh 2010-11-30 16:20 ` Kai Makisara 2010-12-03 12:27 ` FUJITA Tomonori 1 sibling, 2 replies; 38+ messages in thread From: Lukas Kolbe @ 2010-11-30 13:31 UTC (permalink / raw) To: Kai Makisara; +Cc: linux-scsi On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote: Hi, > > On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E, > > Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on > > debian/squeeze), we see reproducible tape read and write failures after > > the system was under memory pressure: > > > > [342567.297152] st0: Can't allocate 2097152 byte tape buffer. > > [342569.316099] st0: Can't allocate 2097152 byte tape buffer. > > [342570.805164] st0: Can't allocate 2097152 byte tape buffer. > > [342571.958331] st0: Can't allocate 2097152 byte tape buffer. > > [342572.704264] st0: Can't allocate 2097152 byte tape buffer. > > [342873.737130] st: from_buffer offset overflow. > > > > Bacula is spewing this message every time it tries to access the tape > > drive: > > 28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output error > > > > By memory pressure, I mean that the KVM processes containing the > > postgres-db (~20million files) and the bacula director have used all > > available RAM, one of them used ~4GiB of its 12GiB swap for an hour or > > so (by selecting a full restore, it seems that the whole directory tree > > of the 15mio files backup gets read into memory). After this, I wasn't > > able to read from the second tape drive anymore (/dev/st0); whereas the > > first tape drive was restoring the data happily (it is currently about > > halfway through a 3TiB restore from 5 tapes). > > > > This same behaviour appears when we're doing a few incremental backups; > > after a while, it just isn't possible to use the tape drives anymore - > > every I/O operation gives an I/O Error, even a simple dd bs=64k > > count=10. After a restart, the system behaves correctly until > > -seemingly- another memory pressure situation occured. > > > This is predictable. The maximum number of scatter/gather segments seems > to be 128. The st driver first tries to set up transfer directly from the > user buffer to the HBA. The user buffer is usually fragmented so that one > scatter/gather segment is used for each page. Assuming 4 kB page size, the > maximu size of the direct transfer is 128 x 4 = 512 kB. > > When this fails, the driver tries to allocate a kernel buffer so that > there larger than 4 kB physically contiguous segments. Let's assume that > it can find 128 16 kB segments. In this case the maximum block size is > 2048 kB. Memory pressure results in memory fragmentation and the driver > can't find large enough segments and allocation fails. This is what you > are seeing. Reasonable explanation, thanks. What makes me wonder is why it still fails *after* memory pressure was gone - ie free shows more than 4GiB of free memory. I had the output of /proc/meminfo at that time but can't find it anymore :/ > So, one solution is to use 512 kB block size. Another one is to try to > find out if the 128 segment limit is a physical limitation or just a > choice. In the latter case the mptsas driver could be modified to support > larger block size even after memory fragmentation. Even with 64kb blocksize (dd bs=64k), I was getting I/O errors trying to access the tape drive. I am now trying to upper the max_sg_segs parameter to the st module (modinfo says 256 is the default; I'm trying 1024 now) and see how well this works under memory pressure. > Kai -- Lukas ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-30 13:31 ` Lukas Kolbe @ 2010-11-30 16:10 ` Boaz Harrosh 2010-11-30 16:23 ` Kai Makisara 2010-11-30 16:20 ` Kai Makisara 1 sibling, 1 reply; 38+ messages in thread From: Boaz Harrosh @ 2010-11-30 16:10 UTC (permalink / raw) To: Lukas Kolbe; +Cc: Kai Makisara, linux-scsi On 11/30/2010 03:31 PM, Lukas Kolbe wrote: > On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote: > > Hi, > >>> On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E, >>> Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on >>> debian/squeeze), we see reproducible tape read and write failures after >>> the system was under memory pressure: >>> >>> [342567.297152] st0: Can't allocate 2097152 byte tape buffer. >>> [342569.316099] st0: Can't allocate 2097152 byte tape buffer. >>> [342570.805164] st0: Can't allocate 2097152 byte tape buffer. >>> [342571.958331] st0: Can't allocate 2097152 byte tape buffer. >>> [342572.704264] st0: Can't allocate 2097152 byte tape buffer. >>> [342873.737130] st: from_buffer offset overflow. >>> >>> Bacula is spewing this message every time it tries to access the tape >>> drive: >>> 28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output error >>> >>> By memory pressure, I mean that the KVM processes containing the >>> postgres-db (~20million files) and the bacula director have used all >>> available RAM, one of them used ~4GiB of its 12GiB swap for an hour or >>> so (by selecting a full restore, it seems that the whole directory tree >>> of the 15mio files backup gets read into memory). After this, I wasn't >>> able to read from the second tape drive anymore (/dev/st0); whereas the >>> first tape drive was restoring the data happily (it is currently about >>> halfway through a 3TiB restore from 5 tapes). >>> >>> This same behaviour appears when we're doing a few incremental backups; >>> after a while, it just isn't possible to use the tape drives anymore - >>> every I/O operation gives an I/O Error, even a simple dd bs=64k >>> count=10. After a restart, the system behaves correctly until >>> -seemingly- another memory pressure situation occured. >>> >> This is predictable. The maximum number of scatter/gather segments seems >> to be 128. The st driver first tries to set up transfer directly from the >> user buffer to the HBA. The user buffer is usually fragmented so that one >> scatter/gather segment is used for each page. Assuming 4 kB page size, the >> maximu size of the direct transfer is 128 x 4 = 512 kB. >> >> When this fails, the driver tries to allocate a kernel buffer so that >> there larger than 4 kB physically contiguous segments. Let's assume that >> it can find 128 16 kB segments. In this case the maximum block size is >> 2048 kB. Memory pressure results in memory fragmentation and the driver >> can't find large enough segments and allocation fails. This is what you >> are seeing. > > Reasonable explanation, thanks. What makes me wonder is why it still > fails *after* memory pressure was gone - ie free shows more than 4GiB of > free memory. I had the output of /proc/meminfo at that time but can't > find it anymore :/ > >> So, one solution is to use 512 kB block size. Another one is to try to >> find out if the 128 segment limit is a physical limitation or just a >> choice. In the latter case the mptsas driver could be modified to support >> larger block size even after memory fragmentation. > > Even with 64kb blocksize (dd bs=64k), I was getting I/O errors trying to > access the tape drive. I am now trying to upper the max_sg_segs > parameter to the st module (modinfo says 256 is the default; I'm trying > 1024 now) and see how well this works under memory pressure. > It looks like something is broken/old-code in sr. Most important LLDs and block-layer scsi-ml fully support sg-chaining that effectively are able to deliver limitless (Only limited by HW) sg sizes. It looks like sr has some code that tries to allocate contiguous buffers larger than PAGE_SIZE. Why does it do that? It should not be necessary any more. >> Kai > Boaz ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-30 16:10 ` Boaz Harrosh @ 2010-11-30 16:23 ` Kai Makisara 2010-11-30 16:44 ` Boaz Harrosh 0 siblings, 1 reply; 38+ messages in thread From: Kai Makisara @ 2010-11-30 16:23 UTC (permalink / raw) To: Boaz Harrosh; +Cc: Lukas Kolbe, linux-scsi On Tue, 30 Nov 2010, Boaz Harrosh wrote: > On 11/30/2010 03:31 PM, Lukas Kolbe wrote: > > On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote: > > .... > It looks like something is broken/old-code in sr. Most important LLDs > and block-layer scsi-ml fully support sg-chaining that effectively are > able to deliver limitless (Only limited by HW) sg sizes. It looks like > sr has some code that tries to allocate contiguous buffers larger than > PAGE_SIZE. Why does it do that? It should not be necessary any more. > The relevant driver is st and it use sg chaining when necessary. I tried to explain that the effective limit in this case comes from mptsas. I don't know if it is HW limit or driver limit. Kai ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-30 16:23 ` Kai Makisara @ 2010-11-30 16:44 ` Boaz Harrosh 2010-11-30 17:04 ` Kai Makisara 0 siblings, 1 reply; 38+ messages in thread From: Boaz Harrosh @ 2010-11-30 16:44 UTC (permalink / raw) To: Kai Makisara; +Cc: Lukas Kolbe, linux-scsi On 11/30/2010 06:23 PM, Kai Makisara wrote: > On Tue, 30 Nov 2010, Boaz Harrosh wrote: > >> On 11/30/2010 03:31 PM, Lukas Kolbe wrote: >>> On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote: >>> > .... >> It looks like something is broken/old-code in sr. Most important LLDs >> and block-layer scsi-ml fully support sg-chaining that effectively are >> able to deliver limitless (Only limited by HW) sg sizes. It looks like >> sr has some code that tries to allocate contiguous buffers larger than >> PAGE_SIZE. Why does it do that? It should not be necessary any more. >> > The relevant driver is st Sorry I meant st, yes. > and it use sg chaining when necessary. I tried > to explain that the effective limit in this case comes from mptsas. I > don't know if it is HW limit or driver limit. > Than I don't understand where is the failing allocation. Where in the code path anyone is trying to allocate something bigger then a page? Please explain? > Kai > Thanks Boaz ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-30 16:44 ` Boaz Harrosh @ 2010-11-30 17:04 ` Kai Makisara 2010-11-30 17:24 ` Boaz Harrosh 0 siblings, 1 reply; 38+ messages in thread From: Kai Makisara @ 2010-11-30 17:04 UTC (permalink / raw) To: Boaz Harrosh; +Cc: Lukas Kolbe, linux-scsi On Tue, 30 Nov 2010, Boaz Harrosh wrote: > On 11/30/2010 06:23 PM, Kai Makisara wrote: > > On Tue, 30 Nov 2010, Boaz Harrosh wrote: > > > >> On 11/30/2010 03:31 PM, Lukas Kolbe wrote: > >>> On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote: > >>> > > .... > >> It looks like something is broken/old-code in sr. Most important LLDs > >> and block-layer scsi-ml fully support sg-chaining that effectively are > >> able to deliver limitless (Only limited by HW) sg sizes. It looks like > >> sr has some code that tries to allocate contiguous buffers larger than > >> PAGE_SIZE. Why does it do that? It should not be necessary any more. > >> > > The relevant driver is st > > Sorry I meant st, yes. > > > and it use sg chaining when necessary. I tried > > to explain that the effective limit in this case comes from mptsas. I > > don't know if it is HW limit or driver limit. > > > > Than I don't understand where is the failing allocation. Where in the > code path anyone is trying to allocate something bigger then a page? > Please explain? > The function enlarge_buffer() in st.c tries to allocate a driver buffer that is large enough for one block so that the number of contiguous memory blocks does not exceed the allowed maximum. Allocation is done using alloc_pages() (at line 3744 in st.c in 2.6.36), usually with order > 0. Kai ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-30 17:04 ` Kai Makisara @ 2010-11-30 17:24 ` Boaz Harrosh 2010-11-30 19:53 ` Kai Makisara 2010-12-03 9:44 ` FUJITA Tomonori 0 siblings, 2 replies; 38+ messages in thread From: Boaz Harrosh @ 2010-11-30 17:24 UTC (permalink / raw) To: Kai Makisara; +Cc: Lukas Kolbe, linux-scsi On 11/30/2010 07:04 PM, Kai Makisara wrote: > On Tue, 30 Nov 2010, Boaz Harrosh wrote: > >> On 11/30/2010 06:23 PM, Kai Makisara wrote: >>> On Tue, 30 Nov 2010, Boaz Harrosh wrote: >>> >>>> On 11/30/2010 03:31 PM, Lukas Kolbe wrote: >>>>> On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote: >>>>> >>> .... >>>> It looks like something is broken/old-code in sr. Most important LLDs >>>> and block-layer scsi-ml fully support sg-chaining that effectively are >>>> able to deliver limitless (Only limited by HW) sg sizes. It looks like >>>> sr has some code that tries to allocate contiguous buffers larger than >>>> PAGE_SIZE. Why does it do that? It should not be necessary any more. >>>> >>> The relevant driver is st >> >> Sorry I meant st, yes. >> >>> and it use sg chaining when necessary. I tried >>> to explain that the effective limit in this case comes from mptsas. I >>> don't know if it is HW limit or driver limit. >>> >> >> Than I don't understand where is the failing allocation. Where in the >> code path anyone is trying to allocate something bigger then a page? >> Please explain? >> > The function enlarge_buffer() in st.c tries to allocate a driver buffer > that is large enough for one block so that the number of contiguous memory > blocks does not exceed the allowed maximum. Allocation is done using > alloc_pages() (at line 3744 in st.c in 2.6.36), usually with order > 0. > I looked at enlarge_buffer() and it looks fragile and broken. If you really need a pointer eg: STbuffer->b_data = page_address(STbuffer->reserved_pages[0]); Than way not use vmalloc() for buffers larger then PAGE_SIZE? But better yet avoid it by keeping a pages_array or sg-list and operate on an aio type operations. > Kai But I understand this is a lot of work on an old driver. Perhaps pre-allocate something big at startup. specified by user? Thanks Boaz ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-30 17:24 ` Boaz Harrosh @ 2010-11-30 19:53 ` Kai Makisara 2010-12-01 9:40 ` Lukas Kolbe 2010-12-02 10:01 ` Lukas Kolbe 2010-12-03 9:44 ` FUJITA Tomonori 1 sibling, 2 replies; 38+ messages in thread From: Kai Makisara @ 2010-11-30 19:53 UTC (permalink / raw) To: Boaz Harrosh; +Cc: Lukas Kolbe, linux-scsi On Tue, 30 Nov 2010, Boaz Harrosh wrote: ... > I looked at enlarge_buffer() and it looks fragile and broken. If you really > need a pointer eg: > STbuffer->b_data = page_address(STbuffer->reserved_pages[0]); > If you think it is broken, please fix it. > Than way not use vmalloc() for buffers larger then PAGE_SIZE? But better yet > avoid it by keeping a pages_array or sg-list and operate on an aio type > operations. > vmalloc() is not a solution here. Think about this from the HBA side. Each s/g segment must be contiguous in the address space the HBA uses. In many cases this is the physical memory address space. Any solution must make sure that the HBA can perform the requested data transfer. > > Kai > > But I understand this is a lot of work on an old driver. Perhaps pre-allocate > something big at startup. specified by user? > This used to be possible at some time and it could be made possible again. But I don't like this option because it means that the users must explicitly set the boot parameters. And it is difficult for me to believe the modern SAS HBAs only support 128 s/g segments. Kai ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-30 19:53 ` Kai Makisara @ 2010-12-01 9:40 ` Lukas Kolbe 2010-12-02 11:17 ` Desai, Kashyap 2010-12-02 10:01 ` Lukas Kolbe 1 sibling, 1 reply; 38+ messages in thread From: Lukas Kolbe @ 2010-12-01 9:40 UTC (permalink / raw) To: Kai Makisara; +Cc: Boaz Harrosh, linux-scsi, Kashyap Desai Am Dienstag, den 30.11.2010, 21:53 +0200 schrieb Kai Makisara: > On Tue, 30 Nov 2010, Boaz Harrosh wrote: I'm Cc'ing Desay Kashyap from LSI, maybe he can comment on the hardware limitations of the SAS1068E? > ... > > I looked at enlarge_buffer() and it looks fragile and broken. If you really > > need a pointer eg: > > STbuffer->b_data = page_address(STbuffer->reserved_pages[0]); > > > If you think it is broken, please fix it. > > > Than way not use vmalloc() for buffers larger then PAGE_SIZE? But better yet > > avoid it by keeping a pages_array or sg-list and operate on an aio type > > operations. > > > vmalloc() is not a solution here. Think about this from the HBA side. Each > s/g segment must be contiguous in the address space the HBA uses. In many > cases this is the physical memory address space. Any solution must make > sure that the HBA can perform the requested data transfer. > > > > Kai > > > > But I understand this is a lot of work on an old driver. Perhaps pre-allocate > > something big at startup. specified by user? > > > This used to be possible at some time and it could be made possible again. > But I don't like this option because it means that the users must > explicitly set the boot parameters. > > And it is difficult for me to believe the modern SAS HBAs only support 128 > s/g segments. > > Kai For reference, here's my original message with Kais reply: > Hi, > > On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E, > Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on > debian/squeeze), we see reproducible tape read and write failures after > the system was under memory pressure: > > [342567.297152] st0: Can't allocate 2097152 byte tape buffer. > [342569.316099] st0: Can't allocate 2097152 byte tape buffer. > [342570.805164] st0: Can't allocate 2097152 byte tape buffer. > [342571.958331] st0: Can't allocate 2097152 byte tape buffer. > [342572.704264] st0: Can't allocate 2097152 byte tape buffer. > [342873.737130] st: from_buffer offset overflow. > > Bacula is spewing this message every time it tries to access the tape > drive: > 28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output error > > By memory pressure, I mean that the KVM processes containing the > postgres-db (~20million files) and the bacula director have used all > available RAM, one of them used ~4GiB of its 12GiB swap for an hour or > so (by selecting a full restore, it seems that the whole directory tree > of the 15mio files backup gets read into memory). After this, I wasn't > able to read from the second tape drive anymore (/dev/st0); whereas the > first tape drive was restoring the data happily (it is currently about > halfway through a 3TiB restore from 5 tapes). > > This same behaviour appears when we're doing a few incremental backups; > after a while, it just isn't possible to use the tape drives anymore - > every I/O operation gives an I/O Error, even a simple dd bs=64k > count=10. After a restart, the system behaves correctly until > -seemingly- another memory pressure situation occured. > This is predictable. The maximum number of scatter/gather segments seems to be 128. The st driver first tries to set up transfer directly from the user buffer to the HBA. The user buffer is usually fragmented so that one scatter/gather segment is used for each page. Assuming 4 kB page size, the maximu size of the direct transfer is 128 x 4 = 512 kB. When this fails, the driver tries to allocate a kernel buffer so that there larger than 4 kB physically contiguous segments. Let's assume that it can find 128 16 kB segments. In this case the maximum block size is 2048 kB. Memory pressure results in memory fragmentation and the driver can't find large enough segments and allocation fails. This is what you are seeing. So, one solution is to use 512 kB block size. Another one is to try to find out if the 128 segment limit is a physical limitation or just a choice. In the latter case the mptsas driver could be modified to support larger block size even after memory fragmentation. Kai ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: After memory pressure: can't read from tape anymore 2010-12-01 9:40 ` Lukas Kolbe @ 2010-12-02 11:17 ` Desai, Kashyap 2010-12-02 16:22 ` Kai Makisara 0 siblings, 1 reply; 38+ messages in thread From: Desai, Kashyap @ 2010-12-02 11:17 UTC (permalink / raw) To: Lukas Kolbe, Kai Makisara; +Cc: Boaz Harrosh, linux-scsi > -----Original Message----- > From: Lukas Kolbe [mailto:lkolbe@techfak.uni-bielefeld.de] > Sent: Wednesday, December 01, 2010 3:10 PM > To: Kai Makisara > Cc: Boaz Harrosh; linux-scsi@vger.kernel.org; Desai, Kashyap > Subject: Re: After memory pressure: can't read from tape anymore > > Am Dienstag, den 30.11.2010, 21:53 +0200 schrieb Kai Makisara: > > On Tue, 30 Nov 2010, Boaz Harrosh wrote: > > I'm Cc'ing Desay Kashyap from LSI, maybe he can comment on the hardware > limitations of the SAS1068E? Lukas, No. it is not limitation from h/w that " CONFIG_FUSION_MAX_SGE" needs to be 128. But our code is written such a way that even if you change it more than 128, it will fall down to 128 again. To change this value you need to do below changes in mptbase.h -- -#define MPT_SCSI_SG_DEPTH CONFIG_FUSION_MAX_SGE +#define MPT_SCSI_SG_DEPTH 256 -- 128 is good amount for Scatter gather element. This value is standard value for MPT FUSIION, since long. This value will be reflect to sg_tablesize and linux scatter-gather module will use this value for creating sg_table for HBA. See: " cat /sys/class/scsi_host/host<x>/sg_tablesize" If single IO is not able to fit into sg_tablesize, then it will be converted into multiple IOs for Low Layer Drivers(By "scatter-gather" module of linux). So I do not see any problem with CONFIG_FUSION_MAX_SGE value. Our driver internally convert sglist into SGE which understood by LSI H/W. Thanks, Kashyap > > > ... > > > I looked at enlarge_buffer() and it looks fragile and broken. If > you really > > > need a pointer eg: > > > STbuffer->b_data = page_address(STbuffer->reserved_pages[0]); > > > > > If you think it is broken, please fix it. > > > > > Than way not use vmalloc() for buffers larger then PAGE_SIZE? But > better yet > > > avoid it by keeping a pages_array or sg-list and operate on an aio > type > > > operations. > > > > > vmalloc() is not a solution here. Think about this from the HBA side. > Each > > s/g segment must be contiguous in the address space the HBA uses. In > many > > cases this is the physical memory address space. Any solution must > make > > sure that the HBA can perform the requested data transfer. > > > > > > Kai > > > > > > But I understand this is a lot of work on an old driver. Perhaps > pre-allocate > > > something big at startup. specified by user? > > > > > This used to be possible at some time and it could be made possible > again. > > But I don't like this option because it means that the users must > > explicitly set the boot parameters. > > > > And it is difficult for me to believe the modern SAS HBAs only > support 128 > > s/g segments. > > > > Kai > > > For reference, here's my original message with Kais reply: > > > Hi, > > > > On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E, > > Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on > > debian/squeeze), we see reproducible tape read and write failures > after > > the system was under memory pressure: > > > > [342567.297152] st0: Can't allocate 2097152 byte tape buffer. > > [342569.316099] st0: Can't allocate 2097152 byte tape buffer. > > [342570.805164] st0: Can't allocate 2097152 byte tape buffer. > > [342571.958331] st0: Can't allocate 2097152 byte tape buffer. > > [342572.704264] st0: Can't allocate 2097152 byte tape buffer. > > [342873.737130] st: from_buffer offset overflow. > > > > Bacula is spewing this message every time it tries to access the tape > > drive: > > 28-Nov 19:58 sd1.techfak JobId 2857: Error: block.c:1002 Read error > on fd=10 at file:blk 0:0 on device "drv2" (/dev/nst0). ERR=Input/output > error > > > > By memory pressure, I mean that the KVM processes containing the > > postgres-db (~20million files) and the bacula director have used all > > available RAM, one of them used ~4GiB of its 12GiB swap for an hour > or > > so (by selecting a full restore, it seems that the whole directory > tree > > of the 15mio files backup gets read into memory). After this, I > wasn't > > able to read from the second tape drive anymore (/dev/st0); whereas > the > > first tape drive was restoring the data happily (it is currently > about > > halfway through a 3TiB restore from 5 tapes). > > > > This same behaviour appears when we're doing a few incremental > backups; > > after a while, it just isn't possible to use the tape drives anymore > - > > every I/O operation gives an I/O Error, even a simple dd bs=64k > > count=10. After a restart, the system behaves correctly until > > -seemingly- another memory pressure situation occured. > > > This is predictable. The maximum number of scatter/gather segments > seems > to be 128. The st driver first tries to set up transfer directly from > the > user buffer to the HBA. The user buffer is usually fragmented so that > one > scatter/gather segment is used for each page. Assuming 4 kB page size, > the > maximu size of the direct transfer is 128 x 4 = 512 kB. > > When this fails, the driver tries to allocate a kernel buffer so that > there larger than 4 kB physically contiguous segments. Let's assume > that > it can find 128 16 kB segments. In this case the maximum block size is > 2048 kB. Memory pressure results in memory fragmentation and the driver > can't find large enough segments and allocation fails. This is what you > are seeing. > > So, one solution is to use 512 kB block size. Another one is to try to > find out if the 128 segment limit is a physical limitation or just a > choice. In the latter case the mptsas driver could be modified to > support > larger block size even after memory fragmentation. > > Kai > > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: After memory pressure: can't read from tape anymore 2010-12-02 11:17 ` Desai, Kashyap @ 2010-12-02 16:22 ` Kai Makisara 2010-12-02 18:14 ` Desai, Kashyap 2010-12-03 10:13 ` FUJITA Tomonori 0 siblings, 2 replies; 38+ messages in thread From: Kai Makisara @ 2010-12-02 16:22 UTC (permalink / raw) To: Desai, Kashyap; +Cc: Lukas Kolbe, Boaz Harrosh, linux-scsi On Thu, 2 Dec 2010, Desai, Kashyap wrote: > > > > -----Original Message----- > > From: Lukas Kolbe [mailto:lkolbe@techfak.uni-bielefeld.de] > > Sent: Wednesday, December 01, 2010 3:10 PM > > To: Kai Makisara > > Cc: Boaz Harrosh; linux-scsi@vger.kernel.org; Desai, Kashyap > > Subject: Re: After memory pressure: can't read from tape anymore > > > > Am Dienstag, den 30.11.2010, 21:53 +0200 schrieb Kai Makisara: > > > On Tue, 30 Nov 2010, Boaz Harrosh wrote: > > > > I'm Cc'ing Desay Kashyap from LSI, maybe he can comment on the hardware > > limitations of the SAS1068E? > Lukas, > > No. it is not limitation from h/w that " CONFIG_FUSION_MAX_SGE" needs to be 128. > But our code is written such a way that even if you change it more than 128, it will fall down to 128 again. > > To change this value you need to do below changes in mptbase.h > > -- > -#define MPT_SCSI_SG_DEPTH CONFIG_FUSION_MAX_SGE > +#define MPT_SCSI_SG_DEPTH 256 > -- > > 128 is good amount for Scatter gather element. This value is standard value for MPT FUSIION, since long. > > This value will be reflect to sg_tablesize and linux scatter-gather module will use this value for creating sg_table for HBA. > See: " cat /sys/class/scsi_host/host<x>/sg_tablesize" > > If single IO is not able to fit into sg_tablesize, then it will be converted into multiple IOs for Low Layer Drivers(By "scatter-gather" module of linux). > So I do not see any problem with > CONFIG_FUSION_MAX_SGE value. Our driver internally convert sglist into SGE which understood by LSI H/W. > You can't convert write of one block into multiple IOs. If someone wants to write 2 MB blocks, the system must transfer 2 MB in one IO. The choices are: 1. Direct I/O from user space. With 4 kB page size, this means that 513 s/g segments (512 if the user buffer page is aligned) must be used. (Unless there is an IOMMU and an API for an ULD to use it.) 2. Use a bounce buffer. Dynamic allocation of the bounce buffer with a small number of segments is problematic when the system has been running for a long time but the buffer can be allocated when the device is detected. (This is problematic if the device is detected at run-time.) I don't know which alternative is more efficient from the HW point of view if both are possible. Do you (or someone else) have any opinion on this? Kai ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: After memory pressure: can't read from tape anymore 2010-12-02 16:22 ` Kai Makisara @ 2010-12-02 18:14 ` Desai, Kashyap 2010-12-02 20:25 ` Kai Makisara 2010-12-03 10:13 ` FUJITA Tomonori 1 sibling, 1 reply; 38+ messages in thread From: Desai, Kashyap @ 2010-12-02 18:14 UTC (permalink / raw) To: Kai Makisara; +Cc: Lukas Kolbe, Boaz Harrosh, linux-scsi > -----Original Message----- > From: Kai Makisara [mailto:Kai.Makisara@kolumbus.fi] > Sent: Thursday, December 02, 2010 9:53 PM > To: Desai, Kashyap > Cc: Lukas Kolbe; Boaz Harrosh; linux-scsi@vger.kernel.org > Subject: RE: After memory pressure: can't read from tape anymore > > On Thu, 2 Dec 2010, Desai, Kashyap wrote: > > > > > > > > -----Original Message----- > > > From: Lukas Kolbe [mailto:lkolbe@techfak.uni-bielefeld.de] > > > Sent: Wednesday, December 01, 2010 3:10 PM > > > To: Kai Makisara > > > Cc: Boaz Harrosh; linux-scsi@vger.kernel.org; Desai, Kashyap > > > Subject: Re: After memory pressure: can't read from tape anymore > > > > > > Am Dienstag, den 30.11.2010, 21:53 +0200 schrieb Kai Makisara: > > > > On Tue, 30 Nov 2010, Boaz Harrosh wrote: > > > > > > I'm Cc'ing Desay Kashyap from LSI, maybe he can comment on the > hardware > > > limitations of the SAS1068E? > > Lukas, > > > > No. it is not limitation from h/w that " CONFIG_FUSION_MAX_SGE" needs > to be 128. > > But our code is written such a way that even if you change it more > than 128, it will fall down to 128 again. > > > > To change this value you need to do below changes in mptbase.h > > > > -- > > -#define MPT_SCSI_SG_DEPTH CONFIG_FUSION_MAX_SGE > > +#define MPT_SCSI_SG_DEPTH 256 > > -- > > > > 128 is good amount for Scatter gather element. This value is standard > value for MPT FUSIION, since long. > > > > This value will be reflect to sg_tablesize and linux scatter-gather > module will use this value for creating sg_table for HBA. > > See: " cat /sys/class/scsi_host/host<x>/sg_tablesize" > > > > If single IO is not able to fit into sg_tablesize, then it will be > converted into multiple IOs for Low Layer Drivers(By "scatter-gather" > module of linux). > > So I do not see any problem with > > CONFIG_FUSION_MAX_SGE value. Our driver internally convert sglist > into SGE which understood by LSI H/W. > > > You can't convert write of one block into multiple IOs. If someone > wants > to write 2 MB blocks, the system must transfer 2 MB in one IO. The > choices > are: I am not sure why single IO cannot be converted into multiple "IO" request. If you run below commands " sg_dd if=/dev/zero of=/dev/sdb bs=4800000 count=1" You will see multiple IOs(requests) are coming to low layer driver. > > 1. Direct I/O from user space. With 4 kB page size, this means that 513 > s/g segments (512 if the user buffer page is aligned) must be used. > (Unless there is an IOMMU and an API for an ULD to use it.) > > 2. Use a bounce buffer. Dynamic allocation of the bounce buffer with a > small number of segments is problematic when the system has been > running > for a long time but the buffer can be allocated when the device is > detected. (This is problematic if the device is detected at run-time.) > > I don't know which alternative is more efficient from the HW point of > view > if both are possible. Do you (or someone else) have any opinion on > this? Regarding above things, I am not sure too, why this has to be part of Low level driver? It has to be done my block layer.... > > Kai ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: After memory pressure: can't read from tape anymore 2010-12-02 18:14 ` Desai, Kashyap @ 2010-12-02 20:25 ` Kai Makisara 2010-12-05 10:44 ` Lukas Kolbe 0 siblings, 1 reply; 38+ messages in thread From: Kai Makisara @ 2010-12-02 20:25 UTC (permalink / raw) To: Desai, Kashyap; +Cc: Lukas Kolbe, Boaz Harrosh, linux-scsi On Thu, 2 Dec 2010, Desai, Kashyap wrote: > > > > -----Original Message----- > > From: Kai Makisara [mailto:Kai.Makisara@kolumbus.fi] ... > > You can't convert write of one block into multiple IOs. If someone > > wants > > to write 2 MB blocks, the system must transfer 2 MB in one IO. The > > choices > > are: > > I am not sure why single IO cannot be converted into multiple "IO" request. > If you run below commands > " sg_dd if=/dev/zero of=/dev/sdb bs=4800000 count=1" > > You will see multiple IOs(requests) are coming to low layer driver. > Yes, but each one is writing one or more disk blocks of 512 bytes. You don't see writes of partial blocks. If the block size of the device is 2 MB, the minimun IO request is 2 MB. (The SCSI commands for sequential access devices are different from the commands for block devices, but the minimum read/write unit is one block in both cases.) Kai P.S. Why is it necessary to use 2 MB blocks? Some people say that it is the optimal block size for some current tape drives. ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: After memory pressure: can't read from tape anymore 2010-12-02 20:25 ` Kai Makisara @ 2010-12-05 10:44 ` Lukas Kolbe 0 siblings, 0 replies; 38+ messages in thread From: Lukas Kolbe @ 2010-12-05 10:44 UTC (permalink / raw) To: Kai Makisara; +Cc: Desai, Kashyap, Boaz Harrosh, linux-scsi Am Donnerstag, den 02.12.2010, 22:25 +0200 schrieb Kai Makisara: > P.S. Why is it necessary to use 2 MB blocks? Some people say that it is > the optimal block size for some current tape drives. I really couldn't care less about blocksizes, but it seems to be that it's impossible to reach high LTO4 write-speeds with lower blocksizes; tests show that >100MiB/s are not possible with blocksizes around 64KB to 512KB (from memory), so: the bigger the blocksize, the higher the writing speed. I suppose this gets even more critical with LTO5 drives. -- Lukas ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: After memory pressure: can't read from tape anymore 2010-12-02 16:22 ` Kai Makisara 2010-12-02 18:14 ` Desai, Kashyap @ 2010-12-03 10:13 ` FUJITA Tomonori 2010-12-03 10:45 ` Desai, Kashyap 1 sibling, 1 reply; 38+ messages in thread From: FUJITA Tomonori @ 2010-12-03 10:13 UTC (permalink / raw) To: Kai.Makisara; +Cc: Kashyap.Desai, lkolbe, bharrosh, linux-scsi On Thu, 2 Dec 2010 18:22:33 +0200 (EET) Kai Makisara <Kai.Makisara@kolumbus.fi> wrote: > > -#define MPT_SCSI_SG_DEPTH CONFIG_FUSION_MAX_SGE > > +#define MPT_SCSI_SG_DEPTH 256 > > -- > > > > 128 is good amount for Scatter gather element. This value is standard value for MPT FUSIION, since long. > > > > This value will be reflect to sg_tablesize and linux scatter-gather module will use this value for creating sg_table for HBA. > > See: " cat /sys/class/scsi_host/host<x>/sg_tablesize" > > > > If single IO is not able to fit into sg_tablesize, then it will be converted into multiple IOs for Low Layer Drivers(By "scatter-gather" module of linux). > > So I do not see any problem with > > CONFIG_FUSION_MAX_SGE value. Our driver internally convert sglist into SGE which understood by LSI H/W. > > > You can't convert write of one block into multiple IOs. If someone wants > to write 2 MB blocks, the system must transfer 2 MB in one IO. The choices > are: I'm not sure that Kashyap is talking about multiple IO requests. SGE is LSI H/W's data structure to describe the set of a dma address and a transfer length. LSI H/W can chain SGE so if you pass large scatter-gather to the driver, it leads to single SCSI command with changed multiple SGEs. I suppose mpt2sas can handle more than 256 scatter gatters. ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: After memory pressure: can't read from tape anymore 2010-12-03 10:13 ` FUJITA Tomonori @ 2010-12-03 10:45 ` Desai, Kashyap 2010-12-03 11:11 ` FUJITA Tomonori 0 siblings, 1 reply; 38+ messages in thread From: Desai, Kashyap @ 2010-12-03 10:45 UTC (permalink / raw) To: FUJITA Tomonori, Kai.Makisara; +Cc: lkolbe, bharrosh, linux-scsi > -----Original Message----- > From: FUJITA Tomonori [mailto:fujita.tomonori@lab.ntt.co.jp] > Sent: Friday, December 03, 2010 3:43 PM > To: Kai.Makisara@kolumbus.fi > Cc: Desai, Kashyap; lkolbe@techfak.uni-bielefeld.de; > bharrosh@panasas.com; linux-scsi@vger.kernel.org > Subject: RE: After memory pressure: can't read from tape anymore > > On Thu, 2 Dec 2010 18:22:33 +0200 (EET) > Kai Makisara <Kai.Makisara@kolumbus.fi> wrote: > > > > -#define MPT_SCSI_SG_DEPTH CONFIG_FUSION_MAX_SGE > > > +#define MPT_SCSI_SG_DEPTH 256 > > > -- > > > > > > 128 is good amount for Scatter gather element. This value is > standard value for MPT FUSIION, since long. > > > > > > This value will be reflect to sg_tablesize and linux scatter-gather > module will use this value for creating sg_table for HBA. > > > See: " cat /sys/class/scsi_host/host<x>/sg_tablesize" > > > > > > If single IO is not able to fit into sg_tablesize, then it will be > converted into multiple IOs for Low Layer Drivers(By "scatter-gather" > module of linux). > > > So I do not see any problem with > > > CONFIG_FUSION_MAX_SGE value. Our driver internally convert sglist > into SGE which understood by LSI H/W. > > > > > You can't convert write of one block into multiple IOs. If someone > wants > > to write 2 MB blocks, the system must transfer 2 MB in one IO. The > choices > > are: > > I'm not sure that Kashyap is talking about multiple IO requests. > > SGE is LSI H/W's data structure to describe the set of a dma address > and a transfer length. LSI H/W can chain SGE so if you pass large > scatter-gather to the driver, it leads to single SCSI command with > changed multiple SGEs. I suppose mpt2sas can handle more than 256 > scatter gatters. Let me add some more input. I do not see any problem with any parameters in driver is playing role in this issue. As earlier suggested by Kai, they think changing " CONFIG_FUSION_MAX_SGE" will help. But I see maximum value for " CONFIG_FUSION_MAX_SGE" is 128 only and we can not go beyond this. I see scsi_alloc_queue below lines blk_queue_max_hw_segments(q, shost->sg_tablesize); blk_queue_max_phys_segments(q, SCSI_MAX_SG_CHAIN_SEGMENTS); both the values play role while selecting max scatter gather elements.. Where max_hw_segements are equal to value of " CONFIG_FUSION_MAX_SGE", but max_phys_segements are limited to below value #define SCSI_MAX_SG_SEGMENTS 128 (defined in scsi.h) Even for mpt2sas " SCSI_MPT2SAS_MAX_SGE" is 128. ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: After memory pressure: can't read from tape anymore 2010-12-03 10:45 ` Desai, Kashyap @ 2010-12-03 11:11 ` FUJITA Tomonori 0 siblings, 0 replies; 38+ messages in thread From: FUJITA Tomonori @ 2010-12-03 11:11 UTC (permalink / raw) To: Kashyap.Desai; +Cc: fujita.tomonori, Kai.Makisara, lkolbe, bharrosh, linux-scsi On Fri, 3 Dec 2010 16:15:34 +0530 "Desai, Kashyap" <Kashyap.Desai@lsi.com> wrote: > But I see maximum value for " CONFIG_FUSION_MAX_SGE" is 128 only and we can not go beyond this. > I see scsi_alloc_queue below lines > > blk_queue_max_hw_segments(q, shost->sg_tablesize); > blk_queue_max_phys_segments(q, SCSI_MAX_SG_CHAIN_SEGMENTS); > > both the values play role while selecting max scatter gather elements.. > Where max_hw_segements are equal to value of " CONFIG_FUSION_MAX_SGE", but max_phys_segements are limited to below value > #define SCSI_MAX_SG_SEGMENTS 128 (defined in scsi.h) Hmm, max_phys_segements is set to SCSI_MAX_SG_CHAIN_SEGMENTS (not SCSI_MAX_SG_SEGMENTS). SCSI_MAX_SG_CHAIN_SEGMENTS can be larger: #ifdef ARCH_HAS_SG_CHAIN #define SCSI_MAX_SG_CHAIN_SEGMENTS 2048 #else #define SCSI_MAX_SG_CHAIN_SEGMENTS SCSI_MAX_SG_SEGMENTS #endif So shost->sg_tablesize (SCSI_MPT2SAS_MAX_SGE) sets the limit here. btw, max_hw_segments and max_phys_segments were consolidated. With newer kernels, you can find the following in __scsi_alloc_queue(): blk_queue_max_segments(q, min_t(unsigned short, shost->sg_tablesize, SCSI_MAX_SG_CHAIN_SEGMENTS)); ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-30 19:53 ` Kai Makisara 2010-12-01 9:40 ` Lukas Kolbe @ 2010-12-02 10:01 ` Lukas Kolbe 1 sibling, 0 replies; 38+ messages in thread From: Lukas Kolbe @ 2010-12-02 10:01 UTC (permalink / raw) To: Kai Makisara; +Cc: Boaz Harrosh, linux-scsi, Kashyap Desai Am Dienstag, den 30.11.2010, 21:53 +0200 schrieb Kai Makisara: > On Tue, 30 Nov 2010, Boaz Harrosh wrote: > > ... > > I looked at enlarge_buffer() and it looks fragile and broken. If you really > > need a pointer eg: > > STbuffer->b_data = page_address(STbuffer->reserved_pages[0]); > > > If you think it is broken, please fix it. > > > Than way not use vmalloc() for buffers larger then PAGE_SIZE? But better yet > > avoid it by keeping a pages_array or sg-list and operate on an aio type > > operations. > > > vmalloc() is not a solution here. Think about this from the HBA side. Each > s/g segment must be contiguous in the address space the HBA uses. In many > cases this is the physical memory address space. Any solution must make > sure that the HBA can perform the requested data transfer. > > > > Kai > > > > But I understand this is a lot of work on an old driver. Perhaps pre-allocate > > something big at startup. specified by user? > > > This used to be possible at some time and it could be made possible again. > But I don't like this option because it means that the users must > explicitly set the boot parameters. > > And it is difficult for me to believe the modern SAS HBAs only support 128 > s/g segments. I'll go ahead and file a bug about this. Maybe this gets it more attention? At the moment, linux 2.6.32-36 is pretty much useless for tape-based backups for us. What's interesting is that when this happens, one of the two tape drives seems to work, where for the second one the system can't allocate large enough buffers. > Kai -- Lukas ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-30 17:24 ` Boaz Harrosh 2010-11-30 19:53 ` Kai Makisara @ 2010-12-03 9:44 ` FUJITA Tomonori 1 sibling, 0 replies; 38+ messages in thread From: FUJITA Tomonori @ 2010-12-03 9:44 UTC (permalink / raw) To: bharrosh; +Cc: Kai.Makisara, lkolbe, linux-scsi On Tue, 30 Nov 2010 19:24:25 +0200 Boaz Harrosh <bharrosh@panasas.com> wrote: > I looked at enlarge_buffer() and it looks fragile and broken. If you really > need a pointer eg: > STbuffer->b_data = page_address(STbuffer->reserved_pages[0]); > > Than way not use vmalloc() for buffers larger then PAGE_SIZE? But better yet As Kai said, this buffer is used for dma so you can't use vmalloc. sg drivers keeps an array for pages. b_data is used for some commands that do small data transfer (< PAGE_SIZE). So the driver exploits the first page for it. > avoid it by keeping a pages_array or sg-list and operate on an aio type > operations. As explained, the driver keeps a pages_array. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-30 13:31 ` Lukas Kolbe 2010-11-30 16:10 ` Boaz Harrosh @ 2010-11-30 16:20 ` Kai Makisara 2010-12-01 17:06 ` Lukas Kolbe 1 sibling, 1 reply; 38+ messages in thread From: Kai Makisara @ 2010-11-30 16:20 UTC (permalink / raw) To: Lukas Kolbe; +Cc: linux-scsi On Tue, 30 Nov 2010, Lukas Kolbe wrote: > On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote: > > Hi, > > > > On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E, > > > Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on > > > debian/squeeze), we see reproducible tape read and write failures after > > > the system was under memory pressure: > > > > > > [342567.297152] st0: Can't allocate 2097152 byte tape buffer. > > > [342569.316099] st0: Can't allocate 2097152 byte tape buffer. > > > [342570.805164] st0: Can't allocate 2097152 byte tape buffer. > > > [342571.958331] st0: Can't allocate 2097152 byte tape buffer. > > > [342572.704264] st0: Can't allocate 2097152 byte tape buffer. > > > [342873.737130] st: from_buffer offset overflow. > > > ... > > When this fails, the driver tries to allocate a kernel buffer so that > > there larger than 4 kB physically contiguous segments. Let's assume that > > it can find 128 16 kB segments. In this case the maximum block size is > > 2048 kB. Memory pressure results in memory fragmentation and the driver > > can't find large enough segments and allocation fails. This is what you > > are seeing. > > Reasonable explanation, thanks. What makes me wonder is why it still > fails *after* memory pressure was gone - ie free shows more than 4GiB of > free memory. I had the output of /proc/meminfo at that time but can't > find it anymore :/ > This is because (AFAIK) the kernel does not defragment the memory. There may be contiguous free pages but the memory management data structures don't show these. > > So, one solution is to use 512 kB block size. Another one is to try to > > find out if the 128 segment limit is a physical limitation or just a > > choice. In the latter case the mptsas driver could be modified to support > > larger block size even after memory fragmentation. > > Even with 64kb blocksize (dd bs=64k), I was getting I/O errors trying to > access the tape drive. I am now trying to upper the max_sg_segs > parameter to the st module (modinfo says 256 is the default; I'm trying > 1024 now) and see how well this works under memory pressure. > This will not help. The final limit is the minimum of the limit of st and the limit of mtpsas. The mptsas limit is 128. This is the limit that should be increased but I don't know if it is possible. If you see error with 64 kB block size, I would like to see any messages associated with these errors. Kai ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-30 16:20 ` Kai Makisara @ 2010-12-01 17:06 ` Lukas Kolbe 2010-12-02 16:41 ` Kai Makisara 0 siblings, 1 reply; 38+ messages in thread From: Lukas Kolbe @ 2010-12-01 17:06 UTC (permalink / raw) To: Kai Makisara; +Cc: linux-scsi, Kashyap Desai Am Dienstag, den 30.11.2010, 18:20 +0200 schrieb Kai Makisara: Hi, in reply to your earlier mail: > On Tue, 30 Nov 2010, Lukas Kolbe wrote: > > > On Mon, 2010-11-29 at 19:09 +0200, Kai Makisara wrote: > > > > Hi, > > > > > > On our backup system (2 LTO4 drives/Tandberg library via LSISAS1068E, > > > > Kernel 2.6.36 with the stock Fusion MPT SAS Host driver 3.04.17 on > > > > debian/squeeze), we see reproducible tape read and write failures after > > > > the system was under memory pressure: > > > > > > > > [342567.297152] st0: Can't allocate 2097152 byte tape buffer. > > > > [342569.316099] st0: Can't allocate 2097152 byte tape buffer. > > > > [342570.805164] st0: Can't allocate 2097152 byte tape buffer. > > > > [342571.958331] st0: Can't allocate 2097152 byte tape buffer. > > > > [342572.704264] st0: Can't allocate 2097152 byte tape buffer. > > > > [342873.737130] st: from_buffer offset overflow. > > > > > ... > > > When this fails, the driver tries to allocate a kernel buffer so that > > > there larger than 4 kB physically contiguous segments. Let's assume that > > > it can find 128 16 kB segments. In this case the maximum block size is > > > 2048 kB. Memory pressure results in memory fragmentation and the driver > > > can't find large enough segments and allocation fails. This is what you > > > are seeing. > > > > Reasonable explanation, thanks. What makes me wonder is why it still > > fails *after* memory pressure was gone - ie free shows more than 4GiB of > > free memory. I had the output of /proc/meminfo at that time but can't > > find it anymore :/ > > > This is because (AFAIK) the kernel does not defragment the memory. There > may be contiguous free pages but the memory management data structures > don't show these. > > > > So, one solution is to use 512 kB block size. Another one is to try to > > > find out if the 128 segment limit is a physical limitation or just a > > > choice. In the latter case the mptsas driver could be modified to support > > > larger block size even after memory fragmentation. > > > > Even with 64kb blocksize (dd bs=64k), I was getting I/O errors trying to > > access the tape drive. I am now trying to upper the max_sg_segs > > parameter to the st module (modinfo says 256 is the default; I'm trying > > 1024 now) and see how well this works under memory pressure. > > > This will not help. The final limit is the minimum of the limit of st and > the limit of mtpsas. The mptsas limit is 128. This is the limit that > should be increased but I don't know if it is possible. > > If you see error with 64 kB block size, I would like to see any messages > associated with these errors. I have now hit this bug again. Trying to read and write a label from the tape drive in question results in this (via bacula's btape command): *readlabel 01-Dec 17:47 btape JobId 0: Error: block.c:1002 Read error on fd=3 at file:blk 0:0 on device "drv1" (/dev/nst1). ERR=Value too large for defined data type. btape: btape.c:525 Volume has no label. Volume Label: Id : **error**VerNo : 0 VolName : PrevVolName : VolFile : 0 LabelType : Unknown 0 LabelSize : 0 PoolName : MediaType : PoolType : HostName : Date label written: -4712-01-01 at 00:00 *label Enter Volume Name: AAA543 01-Dec 17:47 btape JobId 0: Error: block.c:577 Write error at 0:0 on device "drv1" (/dev/nst1). ERR=Input/output error. 01-Dec 17:48 btape JobId 0: Error: Backspace record at EOT failed. ERR=Input/output error Wrote Volume label for volume "AAA543". dmesg says (as expected): [158529.011206] st1: Can't allocate 2097152 byte tape buffer. [158544.348411] st: append_to_buffer offset overflow. [158544.348416] st: append_to_buffer offset overflow. [158544.348418] st: append_to_buffer offset overflow. [158544.348419] st: append_to_buffer offset overflow. Now a dd with 64kb blocksize behaves really strange: root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=1 dd: reading `/dev/nst1': Device or resource busy 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.118717 s, 0.0 kB/s ok, so some process must be using /dev/nst1, right? root@shepherd:~# lsof |grep st1 nope, nothing. Subsequent dd's, only a few seconds later: root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=1 0+0 records in 0+0 records out 0 bytes (0 B) copied, 4.64747 s, 0.0 kB/s root@shepherd:~# echo $? 0 Jeha right, we successfully read EOF/EOT root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=1 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.0041229 s, 0.0 kB/s Possibly another EOT? root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=1 dd: reading `/dev/nst1': Input/output error 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.0128587 s, 0.0 kB/s root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=1 dd: reading `/dev/nst1': Input/output error 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.000236144 s, 0.0 kB/s root@shepherd:~# echo $? 1 Hm, now an I/O error! now dmesg has this to tell me: [158651.882012] st1: Can't allocate 5085561 byte tape buffer. Trying to write to the tape looks like below, which seems to match your earlier description; ie 64/65k works, 128k blocksize works, 256k blocksize and above don't work anymore. I wasn't able to reproduce not being able to write with a 64k blocksize at the moment. root@shepherd:~# lsof |grep st1 root@shepherd:~# dd if=/dev/zero of=/dev/nst1 bs=65k count=100 100+0 records in 100+0 records out 6656000 bytes (6.7 MB) copied, 2.08872 s, 3.2 MB/s root@shepherd:~# dd if=/dev/zero of=/dev/nst1 bs=64k count=100 100+0 records in 100+0 records out 6553600 bytes (6.6 MB) copied, 1.71815 s, 3.8 MB/s root@shepherd:~# dd if=/dev/zero of=/dev/nst1 bs=512k count=100 dd: writing `/dev/nst1': Input/output error 1+0 records in 0+0 records out 0 bytes (0 B) copied, 1.82643 s, 0.0 kB/s root@shepherd:~# dd if=/dev/zero of=/dev/nst1 bs=256k count=100 dd: writing `/dev/nst1': Input/output error 1+0 records in 0+0 records out 0 bytes (0 B) copied, 1.71959 s, 0.0 kB/s root@shepherd:~# dd if=/dev/zero of=/dev/nst1 bs=128k count=100 100+0 records in 100+0 records out 13107200 bytes (13 MB) copied, 2.08911 s, 6.3 MB/s root@shepherd:~# dd if=/dev/zero of=/dev/nst1 bs=64k count=100 100+0 records in 100+0 records out 6553600 bytes (6.6 MB) copied, 1.99401 s, 3.3 MB/s root@shepherd:~# mt -f /dev/nst1 rewind root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=100 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.00889507 s, 0.0 kB/s root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=64k count=10^C root@shepherd:~# lsof |grep st1 root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=512k count=100 dd: reading `/dev/nst1': Device or resource busy 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.000232968 s, 0.0 kB/s root@shepherd:~# lsof |grep st1 root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=128k count=100 0+100 records in 0+100 records out 6656000 bytes (6.7 MB) copied, 0.314093 s, 21.2 MB/s root@shepherd:~# dd if=/dev/nst1 of=/tmp/x bs=256k count=100 dd: reading `/dev/nst1': Device or resource busy 0+0 records in 0+0 records out 0 bytes (0 B) copied, 0.000367819 s, 0.0 kB/s root@shepherd:~# lsof |grep st1 root@shepherd:~# Swap is used mainly because of overcommiting RAM to two VMs, but that memory is rarely accessed. root@shepherd:~# cat /proc/meminfo MemTotal: 8197296 kB MemFree: 72648 kB Buffers: 40496 kB Cached: 1891664 kB SwapCached: 1131684 kB Active: 4258136 kB Inactive: 3452272 kB Active(anon): 4010488 kB Inactive(anon): 1767884 kB Active(file): 247648 kB Inactive(file): 1684388 kB Unevictable: 160 kB Mlocked: 160 kB SwapTotal: 4194300 kB SwapFree: 1398976 kB Dirty: 336920 kB Writeback: 0 kB AnonPages: 4648596 kB Mapped: 4472 kB Shmem: 12 kB Slab: 155140 kB SReclaimable: 109152 kB SUnreclaim: 45988 kB KernelStack: 1448 kB PageTables: 18436 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 8227412 kB Committed_AS: 7884284 kB VmallocTotal: 34359738367 kB VmallocUsed: 59244 kB VmallocChunk: 34359660812 kB HardwareCorrupted: 0 kB HugePages_Total: 64 HugePages_Free: 64 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 7488 kB DirectMap2M: 8380416 kB I hope this somehow helps, Lukas ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-01 17:06 ` Lukas Kolbe @ 2010-12-02 16:41 ` Kai Makisara 2010-12-06 7:59 ` Kai Makisara 0 siblings, 1 reply; 38+ messages in thread From: Kai Makisara @ 2010-12-02 16:41 UTC (permalink / raw) To: Lukas Kolbe; +Cc: linux-scsi, Kashyap Desai On Wed, 1 Dec 2010, Lukas Kolbe wrote: > Am Dienstag, den 30.11.2010, 18:20 +0200 schrieb Kai Makisara: > ... > > If you see error with 64 kB block size, I would like to see any messages > > associated with these errors. > > I have now hit this bug again. Trying to read and write a label from the > tape drive in question results in this (via bacula's btape command): > > *readlabel > 01-Dec 17:47 btape JobId 0: Error: block.c:1002 Read error on fd=3 at > file:blk 0:0 on device "drv1" (/dev/nst1). ERR=Value too large for > defined data type. > btape: btape.c:525 Volume has no label. > > Volume Label: > Id : **error**VerNo : 0 > VolName : > PrevVolName : > VolFile : 0 > LabelType : Unknown 0 > LabelSize : 0 > PoolName : > MediaType : > PoolType : > HostName : > Date label written: -4712-01-01 at 00:00 > *label > Enter Volume Name: AAA543 > 01-Dec 17:47 btape JobId 0: Error: block.c:577 Write error at 0:0 on > device "drv1" (/dev/nst1). ERR=Input/output error. > 01-Dec 17:48 btape JobId 0: Error: Backspace record at EOT failed. > ERR=Input/output error > Wrote Volume label for volume "AAA543". > > dmesg says (as expected): > > [158529.011206] st1: Can't allocate 2097152 byte tape buffer. > [158544.348411] st: append_to_buffer offset overflow. > [158544.348416] st: append_to_buffer offset overflow. > [158544.348418] st: append_to_buffer offset overflow. > [158544.348419] st: append_to_buffer offset overflow. > The messages except the first one are something that should never happen. I think that there is something wrong with returning from enlarge_buffer() when it fails. I will look at this when I have time. ... > Trying to write to the tape looks like below, which seems to match your > earlier description; ie 64/65k works, 128k blocksize works, 256k > blocksize and above don't work anymore. I wasn't able to reproduce not > being able to write with a 64k blocksize at the moment. > Assuming that the MPT driver is configured for 128 segments, you should always be able to read and write 512 kB blocks. ... > I hope this somehow helps, > Lukas > These messages are helpful. Thanks, Kai ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-02 16:41 ` Kai Makisara @ 2010-12-06 7:59 ` Kai Makisara 2010-12-06 8:50 ` FUJITA Tomonori 2010-12-06 9:36 ` Lukas Kolbe 0 siblings, 2 replies; 38+ messages in thread From: Kai Makisara @ 2010-12-06 7:59 UTC (permalink / raw) To: Lukas Kolbe; +Cc: linux-scsi, FUJITA Tomonori On Thu, 2 Dec 2010, Kai Makisara wrote: > On Wed, 1 Dec 2010, Lukas Kolbe wrote: > > > Am Dienstag, den 30.11.2010, 18:20 +0200 schrieb Kai Makisara: > > > ... > > > If you see error with 64 kB block size, I would like to see any messages > > > associated with these errors. > > > > I have now hit this bug again. Trying to read and write a label from the > > tape drive in question results in this (via bacula's btape command): > > ... > > [158529.011206] st1: Can't allocate 2097152 byte tape buffer. > > [158544.348411] st: append_to_buffer offset overflow. > > [158544.348416] st: append_to_buffer offset overflow. > > [158544.348418] st: append_to_buffer offset overflow. > > [158544.348419] st: append_to_buffer offset overflow. > > > The messages except the first one are something that should never happen. > I think that there is something wrong with returning from enlarge_buffer() > when it fails. I will look at this when I have time. > OK, today I have had some time (national holiday). I think I have tracked down this problem. The patch at the end should fix it. Basically, normalize_buffer() needs to know the order of the pages in order to properly free the pages and update the buffer size. When allocation failed, the order was not yet stored into the tape buffer definition. This does explain the problem after allocation failed. Why allocation failed is another problem which has been discussed elsewhere. Kai -----------------------------------8<------------------------------------ --- linux-2.6.36.1/drivers/scsi/st.c.org 2010-12-05 17:07:04.285226110 +0200 +++ linux-2.6.36.1/drivers/scsi/st.c 2010-12-06 09:46:34.756000154 +0200 @@ -3729,6 +3729,7 @@ static int enlarge_buffer(struct st_buff order < ST_MAX_ORDER && b_size < new_size; order++, b_size *= 2) ; /* empty */ + STbuffer->reserved_page_order = order; } if (max_segs * (PAGE_SIZE << order) < new_size) { if (order == ST_MAX_ORDER) @@ -3755,7 +3756,6 @@ static int enlarge_buffer(struct st_buff segs++; } STbuffer->b_data = page_address(STbuffer->reserved_pages[0]); - STbuffer->reserved_page_order = order; return 1; } ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-06 7:59 ` Kai Makisara @ 2010-12-06 8:50 ` FUJITA Tomonori 2010-12-06 9:36 ` Lukas Kolbe 1 sibling, 0 replies; 38+ messages in thread From: FUJITA Tomonori @ 2010-12-06 8:50 UTC (permalink / raw) To: Kai.Makisara; +Cc: lkolbe, linux-scsi, fujita.tomonori On Mon, 6 Dec 2010 09:59:11 +0200 (EET) Kai Makisara <Kai.Makisara@kolumbus.fi> wrote: > On Thu, 2 Dec 2010, Kai Makisara wrote: > > > On Wed, 1 Dec 2010, Lukas Kolbe wrote: > > > > > Am Dienstag, den 30.11.2010, 18:20 +0200 schrieb Kai Makisara: > > > > > ... > > > > If you see error with 64 kB block size, I would like to see any messages > > > > associated with these errors. > > > > > > I have now hit this bug again. Trying to read and write a label from the > > > tape drive in question results in this (via bacula's btape command): > > > > ... > > > [158529.011206] st1: Can't allocate 2097152 byte tape buffer. > > > [158544.348411] st: append_to_buffer offset overflow. > > > [158544.348416] st: append_to_buffer offset overflow. > > > [158544.348418] st: append_to_buffer offset overflow. > > > [158544.348419] st: append_to_buffer offset overflow. > > > > > The messages except the first one are something that should never happen. > > I think that there is something wrong with returning from enlarge_buffer() > > when it fails. I will look at this when I have time. > > > OK, today I have had some time (national holiday). I think I have tracked > down this problem. The patch at the end should fix it. Basically, > normalize_buffer() needs to know the order of the pages in order to > properly free the pages and update the buffer size. When allocation > failed, the order was not yet stored into the tape buffer definition. This > does explain the problem after allocation failed. Ah, nice catch! I've not tested the patch but looks correct. > Why allocation failed is > another problem which has been discussed elsewhere. Yeah. Thanks a lot! ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-06 7:59 ` Kai Makisara 2010-12-06 8:50 ` FUJITA Tomonori @ 2010-12-06 9:36 ` Lukas Kolbe 2010-12-06 11:34 ` Bjørn Mork 2010-12-08 14:19 ` Lukas Kolbe 1 sibling, 2 replies; 38+ messages in thread From: Lukas Kolbe @ 2010-12-06 9:36 UTC (permalink / raw) To: Kai Makisara; +Cc: linux-scsi, FUJITA Tomonori On Mon, 2010-12-06 at 09:59 +0200, Kai Makisara wrote: > On Thu, 2 Dec 2010, Kai Makisara wrote: > OK, today I have had some time (national holiday). I think I have tracked > down this problem. The patch at the end should fix it. Basically, > normalize_buffer() needs to know the order of the pages in order to > properly free the pages and update the buffer size. When allocation > failed, the order was not yet stored into the tape buffer definition. This > does explain the problem after allocation failed. Thank you, this does indeed look nasty (like a not easily catched bug). > Why allocation failed is another problem which has been discussed elsewhere. I just tonight discovered that Debian's .config has CONFIG_FUSION_MAX_SGE=40. Right now I'm testing with the mptsas' drivers maximum of 128 and see if I can reproduce the failure under memory pressure again. Sorry for not catching this one earlier! After reading the source, I had the assumption that it already was 128 ... -- Lukas ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-06 9:36 ` Lukas Kolbe @ 2010-12-06 11:34 ` Bjørn Mork 2010-12-08 14:19 ` Lukas Kolbe 1 sibling, 0 replies; 38+ messages in thread From: Bjørn Mork @ 2010-12-06 11:34 UTC (permalink / raw) To: linux-scsi Lukas Kolbe <lkolbe@techfak.uni-bielefeld.de> writes: > I just tonight discovered that Debian's .config has > CONFIG_FUSION_MAX_SGE=40. Right now I'm testing with the mptsas' drivers > maximum of 128 and see if I can reproduce the failure under memory > pressure again. Sorry for not catching this one earlier! After reading > the source, I had the assumption that it already was 128 ... So did I... I assume the difference is unintentional so I opened a Debian bug: http://bugs.debian.org/606096 Bjørn -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-06 9:36 ` Lukas Kolbe 2010-12-06 11:34 ` Bjørn Mork @ 2010-12-08 14:19 ` Lukas Kolbe 1 sibling, 0 replies; 38+ messages in thread From: Lukas Kolbe @ 2010-12-08 14:19 UTC (permalink / raw) To: Kai Makisara; +Cc: linux-scsi, FUJITA Tomonori On Mon, 2010-12-06 at 10:36 +0100, Lukas Kolbe wrote: > On Mon, 2010-12-06 at 09:59 +0200, Kai Makisara wrote: > > On Thu, 2 Dec 2010, Kai Makisara wrote: > > > OK, today I have had some time (national holiday). I think I have tracked > > down this problem. The patch at the end should fix it. Basically, > > normalize_buffer() needs to know the order of the pages in order to > > properly free the pages and update the buffer size. When allocation > > failed, the order was not yet stored into the tape buffer definition. This > > does explain the problem after allocation failed. > > Thank you, this does indeed look nasty (like a not easily catched bug). > > > Why allocation failed is another problem which has been discussed elsewhere. > > I just tonight discovered that Debian's .config has > CONFIG_FUSION_MAX_SGE=40. Right now I'm testing with the mptsas' drivers > maximum of 128 and see if I can reproduce the failure under memory > pressure again. Sorry for not catching this one earlier! After reading > the source, I had the assumption that it already was 128 ... Debian has since fixed to the new default value here, and after a few days of testing with CONFIG_FUSION_MAX_SGE=128 and your patch I wasn't able to reproduce the failure anymore! Kind regards, Lukas ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-11-29 17:09 ` Kai Makisara 2010-11-30 13:31 ` Lukas Kolbe @ 2010-12-03 12:27 ` FUJITA Tomonori 2010-12-03 14:59 ` Kai Mäkisara 1 sibling, 1 reply; 38+ messages in thread From: FUJITA Tomonori @ 2010-12-03 12:27 UTC (permalink / raw) To: Kai.Makisara; +Cc: lkolbe, linux-scsi On Mon, 29 Nov 2010 19:09:46 +0200 (EET) Kai Makisara <Kai.Makisara@kolumbus.fi> wrote: > > This same behaviour appears when we're doing a few incremental backups; > > after a while, it just isn't possible to use the tape drives anymore - > > every I/O operation gives an I/O Error, even a simple dd bs=64k > > count=10. After a restart, the system behaves correctly until > > -seemingly- another memory pressure situation occured. > > > This is predictable. The maximum number of scatter/gather segments seems > to be 128. The st driver first tries to set up transfer directly from the > user buffer to the HBA. The user buffer is usually fragmented so that one > scatter/gather segment is used for each page. Assuming 4 kB page size, the > maximu size of the direct transfer is 128 x 4 = 512 kB. Can we make enlarge_buffer friendly to the memory alloctor a bit? His problem is that the driver can't allocate 2 mB with the hardware limit 128 segments. enlarge_buffer tries to use ST_MAX_ORDER and if the allocation (256 kB page) fails, enlarge_buffer fails. It could try smaller order instead? Not tested at all. diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c index 5b7388f..119544b 100644 --- a/drivers/scsi/st.c +++ b/drivers/scsi/st.c @@ -3729,7 +3729,8 @@ static int enlarge_buffer(struct st_buffer * STbuffer, int new_size, int need_dm b_size = PAGE_SIZE << order; } else { for (b_size = PAGE_SIZE, order = 0; - order < ST_MAX_ORDER && b_size < new_size; + order < ST_MAX_ORDER && + max_segs * (PAGE_SIZE << order) < new_size; order++, b_size *= 2) ; /* empty */ } ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-03 12:27 ` FUJITA Tomonori @ 2010-12-03 14:59 ` Kai Mäkisara 2010-12-03 15:06 ` James Bottomley 0 siblings, 1 reply; 38+ messages in thread From: Kai Mäkisara @ 2010-12-03 14:59 UTC (permalink / raw) To: FUJITA Tomonori; +Cc: lkolbe, linux-scsi On 12/03/2010 02:27 PM, FUJITA Tomonori wrote: > On Mon, 29 Nov 2010 19:09:46 +0200 (EET) > Kai Makisara<Kai.Makisara@kolumbus.fi> wrote: > >>> This same behaviour appears when we're doing a few incremental backups; >>> after a while, it just isn't possible to use the tape drives anymore - >>> every I/O operation gives an I/O Error, even a simple dd bs=64k >>> count=10. After a restart, the system behaves correctly until >>> -seemingly- another memory pressure situation occured. >>> >> This is predictable. The maximum number of scatter/gather segments seems >> to be 128. The st driver first tries to set up transfer directly from the >> user buffer to the HBA. The user buffer is usually fragmented so that one >> scatter/gather segment is used for each page. Assuming 4 kB page size, the >> maximu size of the direct transfer is 128 x 4 = 512 kB. > > Can we make enlarge_buffer friendly to the memory alloctor a bit? > > His problem is that the driver can't allocate 2 mB with the hardware > limit 128 segments. > > enlarge_buffer tries to use ST_MAX_ORDER and if the allocation (256 kB > page) fails, enlarge_buffer fails. It could try smaller order instead? > > Not tested at all. > > > diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c > index 5b7388f..119544b 100644 > --- a/drivers/scsi/st.c > +++ b/drivers/scsi/st.c > @@ -3729,7 +3729,8 @@ static int enlarge_buffer(struct st_buffer * STbuffer, int new_size, int need_dm > b_size = PAGE_SIZE<< order; > } else { > for (b_size = PAGE_SIZE, order = 0; > - order< ST_MAX_ORDER&& b_size< new_size; > + order< ST_MAX_ORDER&& > + max_segs * (PAGE_SIZE<< order)< new_size; > order++, b_size *= 2) > ; /* empty */ > } You are correct. The loop does not work at all as it should. Years ago, the strategy was to start with as big blocks as possible to minimize the number s/g segments. Nowadays the segments must be of same size and the old logic is not applicable. I have not tested the patch either but it looks correct. Thanks for noticing this bug. I hope this helps the users. The question about number of s/g segments is still valid for the direct i/o case but that is optimization and not whether one can read/write. Thanks, Kai ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-03 14:59 ` Kai Mäkisara @ 2010-12-03 15:06 ` James Bottomley 2010-12-03 17:03 ` Lukas Kolbe 0 siblings, 1 reply; 38+ messages in thread From: James Bottomley @ 2010-12-03 15:06 UTC (permalink / raw) To: Kai Mäkisara; +Cc: FUJITA Tomonori, lkolbe, linux-scsi On Fri, 2010-12-03 at 16:59 +0200, Kai Mäkisara wrote: > On 12/03/2010 02:27 PM, FUJITA Tomonori wrote: > > On Mon, 29 Nov 2010 19:09:46 +0200 (EET) > > Kai Makisara<Kai.Makisara@kolumbus.fi> wrote: > > > >>> This same behaviour appears when we're doing a few incremental backups; > >>> after a while, it just isn't possible to use the tape drives anymore - > >>> every I/O operation gives an I/O Error, even a simple dd bs=64k > >>> count=10. After a restart, the system behaves correctly until > >>> -seemingly- another memory pressure situation occured. > >>> > >> This is predictable. The maximum number of scatter/gather segments seems > >> to be 128. The st driver first tries to set up transfer directly from the > >> user buffer to the HBA. The user buffer is usually fragmented so that one > >> scatter/gather segment is used for each page. Assuming 4 kB page size, the > >> maximu size of the direct transfer is 128 x 4 = 512 kB. > > > > Can we make enlarge_buffer friendly to the memory alloctor a bit? > > > > His problem is that the driver can't allocate 2 mB with the hardware > > limit 128 segments. > > > > enlarge_buffer tries to use ST_MAX_ORDER and if the allocation (256 kB > > page) fails, enlarge_buffer fails. It could try smaller order instead? > > > > Not tested at all. > > > > > > diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c > > index 5b7388f..119544b 100644 > > --- a/drivers/scsi/st.c > > +++ b/drivers/scsi/st.c > > @@ -3729,7 +3729,8 @@ static int enlarge_buffer(struct st_buffer * STbuffer, int new_size, int need_dm > > b_size = PAGE_SIZE<< order; > > } else { > > for (b_size = PAGE_SIZE, order = 0; > > - order< ST_MAX_ORDER&& b_size< new_size; > > + order< ST_MAX_ORDER&& > > + max_segs * (PAGE_SIZE<< order)< new_size; > > order++, b_size *= 2) > > ; /* empty */ > > } > > You are correct. The loop does not work at all as it should. Years ago, > the strategy was to start with as big blocks as possible to minimize the > number s/g segments. Nowadays the segments must be of same size and the > old logic is not applicable. > > I have not tested the patch either but it looks correct. > > Thanks for noticing this bug. I hope this helps the users. The question > about number of s/g segments is still valid for the direct i/o case but > that is optimization and not whether one can read/write. Realistically, though, this will only increase the probability of making an allocation work, we can't get this to a certainty. Since we fixed up the infrastructure to allow arbitrary length sg lists, perhaps we should document what cards can actually take advantage of this (and how to do so, since it's not set automatically on boot). That way users wanting tapes at least know what the problems are likely to be and how to avoid them in their hardware purchasing decisions. The corollary is that we should likely have a list of not recommended cards: if they can't go over 128 SG elements, then they're pretty much unsuitable for modern tapes. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-03 15:06 ` James Bottomley @ 2010-12-03 17:03 ` Lukas Kolbe 2010-12-03 18:10 ` James Bottomley 0 siblings, 1 reply; 38+ messages in thread From: Lukas Kolbe @ 2010-12-03 17:03 UTC (permalink / raw) To: James Bottomley Cc: Kai Mäkisara, FUJITA Tomonori, linux-scsi, Kashyap Desai Am Freitag, den 03.12.2010, 09:06 -0600 schrieb James Bottomley: > On Fri, 2010-12-03 at 16:59 +0200, Kai Mäkisara wrote: > > On 12/03/2010 02:27 PM, FUJITA Tomonori wrote: > > > > > > Can we make enlarge_buffer friendly to the memory alloctor a bit? > > > > > > His problem is that the driver can't allocate 2 mB with the hardware > > > limit 128 segments. > > > > > > enlarge_buffer tries to use ST_MAX_ORDER and if the allocation (256 kB > > > page) fails, enlarge_buffer fails. It could try smaller order instead? > > > > > > Not tested at all. > > > > > > > > > diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c > > > index 5b7388f..119544b 100644 > > > --- a/drivers/scsi/st.c > > > +++ b/drivers/scsi/st.c > > > @@ -3729,7 +3729,8 @@ static int enlarge_buffer(struct st_buffer * STbuffer, int new_size, int need_dm > > > b_size = PAGE_SIZE<< order; > > > } else { > > > for (b_size = PAGE_SIZE, order = 0; > > > - order< ST_MAX_ORDER&& b_size< new_size; > > > + order< ST_MAX_ORDER&& > > > + max_segs * (PAGE_SIZE<< order)< new_size; > > > order++, b_size *= 2) > > > ; /* empty */ > > > } > > > > You are correct. The loop does not work at all as it should. Years ago, > > the strategy was to start with as big blocks as possible to minimize the > > number s/g segments. Nowadays the segments must be of same size and the > > old logic is not applicable. > > > > I have not tested the patch either but it looks correct. > > > > Thanks for noticing this bug. I hope this helps the users. The question > > about number of s/g segments is still valid for the direct i/o case but > > that is optimization and not whether one can read/write. > > Realistically, though, this will only increase the probability of making > an allocation work, we can't get this to a certainty. > > Since we fixed up the infrastructure to allow arbitrary length sg lists, > perhaps we should document what cards can actually take advantage of > this (and how to do so, since it's not set automatically on boot). That > way users wanting tapes at least know what the problems are likely to be > and how to avoid them in their hardware purchasing decisions. The > corollary is that we should likely have a list of not recommended cards: > if they can't go over 128 SG elements, then they're pretty much > unsuitable for modern tapes. Are you implying here that the LSI SAS1068E is unsuitable to drive two LTO-4 tape drives? Or is it 'just' a problem with the driver? I'll test both the above patch if it helps in our situation and report back. -- Lukas -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-03 17:03 ` Lukas Kolbe @ 2010-12-03 18:10 ` James Bottomley 2010-12-05 10:53 ` Lukas Kolbe 2010-12-14 20:35 ` Vladislav Bolkhovitin 0 siblings, 2 replies; 38+ messages in thread From: James Bottomley @ 2010-12-03 18:10 UTC (permalink / raw) To: Lukas Kolbe; +Cc: Kai Mäkisara, FUJITA Tomonori, linux-scsi, Kashyap Desai On Fri, 2010-12-03 at 18:03 +0100, Lukas Kolbe wrote: > Am Freitag, den 03.12.2010, 09:06 -0600 schrieb James Bottomley: > > On Fri, 2010-12-03 at 16:59 +0200, Kai Mäkisara wrote: > > > On 12/03/2010 02:27 PM, FUJITA Tomonori wrote: > > > > > > > > Can we make enlarge_buffer friendly to the memory alloctor a bit? > > > > > > > > His problem is that the driver can't allocate 2 mB with the hardware > > > > limit 128 segments. > > > > > > > > enlarge_buffer tries to use ST_MAX_ORDER and if the allocation (256 kB > > > > page) fails, enlarge_buffer fails. It could try smaller order instead? > > > > > > > > Not tested at all. > > > > > > > > > > > > diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c > > > > index 5b7388f..119544b 100644 > > > > --- a/drivers/scsi/st.c > > > > +++ b/drivers/scsi/st.c > > > > @@ -3729,7 +3729,8 @@ static int enlarge_buffer(struct st_buffer * STbuffer, int new_size, int need_dm > > > > b_size = PAGE_SIZE<< order; > > > > } else { > > > > for (b_size = PAGE_SIZE, order = 0; > > > > - order< ST_MAX_ORDER&& b_size< new_size; > > > > + order< ST_MAX_ORDER&& > > > > + max_segs * (PAGE_SIZE<< order)< new_size; > > > > order++, b_size *= 2) > > > > ; /* empty */ > > > > } > > > > > > You are correct. The loop does not work at all as it should. Years ago, > > > the strategy was to start with as big blocks as possible to minimize the > > > number s/g segments. Nowadays the segments must be of same size and the > > > old logic is not applicable. > > > > > > I have not tested the patch either but it looks correct. > > > > > > Thanks for noticing this bug. I hope this helps the users. The question > > > about number of s/g segments is still valid for the direct i/o case but > > > that is optimization and not whether one can read/write. > > > > Realistically, though, this will only increase the probability of making > > an allocation work, we can't get this to a certainty. > > > > Since we fixed up the infrastructure to allow arbitrary length sg lists, > > perhaps we should document what cards can actually take advantage of > > this (and how to do so, since it's not set automatically on boot). That > > way users wanting tapes at least know what the problems are likely to be > > and how to avoid them in their hardware purchasing decisions. The > > corollary is that we should likely have a list of not recommended cards: > > if they can't go over 128 SG elements, then they're pretty much > > unsuitable for modern tapes. > > Are you implying here that the LSI SAS1068E is unsuitable to drive two > LTO-4 tape drives? Or is it 'just' a problem with the driver? The information seems to be the former. There's no way the kernel can guarantee physical contiguity of memory as it operates. We try to defrag, but it's probabalistic, not certain, so if we have to try to find a physically contiguous buffer to copy into for an operation like this, at some point that allocation is going to fail. The only way to be certain you can get a 2MB block down to a tape device is to be able to transmit the whole thing as a SG list of fully discontiguous pages. On a system with 4k pages, that requires 512 SG entries. From what I've heard Kashyap say, that can't currently be done on the 1068 because of firmware limitations (I'm not entirely clear on this, but that's how it sounds to me ... if there is a way of making firmware accept more than 128 SG elements per SCSI command, then it is a fairly simple driver change). This isn't something we can work around in the driver because the transaction can't be split ... it has to go down as a single WRITE command with a single output data buffer. The LSI 1068 is an upgradeable firmware system, so it's always possible LSI can come up with a firmware update that increases the size (this would also require a corresponding driver change), but it doesn't sound to be something that can be done in the driver alone. James -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-03 18:10 ` James Bottomley @ 2010-12-05 10:53 ` Lukas Kolbe 2010-12-05 12:16 ` FUJITA Tomonori 2010-12-14 20:35 ` Vladislav Bolkhovitin 1 sibling, 1 reply; 38+ messages in thread From: Lukas Kolbe @ 2010-12-05 10:53 UTC (permalink / raw) To: James Bottomley Cc: Kai Mäkisara, FUJITA Tomonori, linux-scsi, Kashyap Desai Am Freitag, den 03.12.2010, 12:10 -0600 schrieb James Bottomley: > On Fri, 2010-12-03 at 18:03 +0100, Lukas Kolbe wrote: > > Am Freitag, den 03.12.2010, 09:06 -0600 schrieb James Bottomley: > > > On Fri, 2010-12-03 at 16:59 +0200, Kai Mäkisara wrote: > > > > On 12/03/2010 02:27 PM, FUJITA Tomonori wrote: > > > > > > > > > > Can we make enlarge_buffer friendly to the memory alloctor a bit? > > > > > > > > > > His problem is that the driver can't allocate 2 mB with the hardware > > > > > limit 128 segments. > > > > > > > > > > enlarge_buffer tries to use ST_MAX_ORDER and if the allocation (256 kB > > > > > page) fails, enlarge_buffer fails. It could try smaller order instead? > > > > > > > > > > Not tested at all. > > > > > > > > > > > > > > > diff --git a/drivers/scsi/st.c b/drivers/scsi/st.c > > > > > index 5b7388f..119544b 100644 > > > > > --- a/drivers/scsi/st.c > > > > > +++ b/drivers/scsi/st.c > > > > > @@ -3729,7 +3729,8 @@ static int enlarge_buffer(struct st_buffer * STbuffer, int new_size, int need_dm > > > > > b_size = PAGE_SIZE<< order; > > > > > } else { > > > > > for (b_size = PAGE_SIZE, order = 0; > > > > > - order< ST_MAX_ORDER&& b_size< new_size; > > > > > + order< ST_MAX_ORDER&& > > > > > + max_segs * (PAGE_SIZE<< order)< new_size; > > > > > order++, b_size *= 2) > > > > > ; /* empty */ > > > > > } > > > > > > > > You are correct. The loop does not work at all as it should. Years ago, > > > > the strategy was to start with as big blocks as possible to minimize the > > > > number s/g segments. Nowadays the segments must be of same size and the > > > > old logic is not applicable. > > > > > > > > I have not tested the patch either but it looks correct. > > > > > > > > Thanks for noticing this bug. I hope this helps the users. The question > > > > about number of s/g segments is still valid for the direct i/o case but > > > > that is optimization and not whether one can read/write. > > > > > > Realistically, though, this will only increase the probability of making > > > an allocation work, we can't get this to a certainty. > > > > > > Since we fixed up the infrastructure to allow arbitrary length sg lists, > > > perhaps we should document what cards can actually take advantage of > > > this (and how to do so, since it's not set automatically on boot). That > > > way users wanting tapes at least know what the problems are likely to be > > > and how to avoid them in their hardware purchasing decisions. The > > > corollary is that we should likely have a list of not recommended cards: > > > if they can't go over 128 SG elements, then they're pretty much > > > unsuitable for modern tapes. > > > > Are you implying here that the LSI SAS1068E is unsuitable to drive two > > LTO-4 tape drives? Or is it 'just' a problem with the driver? > > The information seems to be the former. There's no way the kernel can > guarantee physical contiguity of memory as it operates. We try to > defrag, but it's probabalistic, not certain, so if we have to try to > find a physically contiguous buffer to copy into for an operation like > this, at some point that allocation is going to fail. > > The only way to be certain you can get a 2MB block down to a tape device > is to be able to transmit the whole thing as a SG list of fully > discontiguous pages. On a system with 4k pages, that requires 512 SG > entries. From what I've heard Kashyap say, that can't currently be done > on the 1068 because of firmware limitations (I'm not entirely clear on > this, but that's how it sounds to me ... if there is a way of making > firmware accept more than 128 SG elements per SCSI command, then it is a > fairly simple driver change). Well, 2MB blocksizes actually do work - bacula is reporting a blocksize of ~2MB for each drive while writing to it - only after there was memory pressure and a new tape got inserted, it is *not* possible anymore to write to the tape with these blocksizes, and dmesg tells me one of these every time bacula tries to read from or write to a tape: [101883.958351] st0: Can't allocate 2097152 byte tape buffer. [103901.666608] st0: Can't allocate 10249541 byte tape buffer. No idea why it's trying 10MB, though. I tested with the patch from Fujita, and this messages from before applying the patch: [158544.348411] st: append_to_buffer offset overflow. do not appear anymore. It didn't help on the not-being-able-to-write-after-memory-pressure matter, though. > This isn't something we can work around > in the driver because the transaction can't be split ... it has to go > down as a single WRITE command with a single output data buffer. > > The LSI 1068 is an upgradeable firmware system, so it's always possible > LSI can come up with a firmware update that increases the size (this > would also require a corresponding driver change), but it doesn't sound > to be something that can be done in the driver alone. If only LSI's website were a little more clear on where to find updated firmware and what was the latest version :/. -- Lukas -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-05 10:53 ` Lukas Kolbe @ 2010-12-05 12:16 ` FUJITA Tomonori 0 siblings, 0 replies; 38+ messages in thread From: FUJITA Tomonori @ 2010-12-05 12:16 UTC (permalink / raw) To: lkolbe Cc: James.Bottomley, kai.makisara, fujita.tomonori, linux-scsi, Kashyap.Desai On Sun, 05 Dec 2010 11:53:03 +0100 Lukas Kolbe <lkolbe@techfak.uni-bielefeld.de> wrote: > Well, 2MB blocksizes actually do work - bacula is reporting a blocksize > of ~2MB for each drive while writing to it - only after there was memory > pressure and a new tape got inserted, it is *not* possible anymore to > write to the tape with these blocksizes, and dmesg tells me one of these > every time bacula tries to read from or write to a tape: I don't know how bacula works but I guess that it closes and reopen a st device when you insert a new tape. The driver frees the memory when it closes the device. > [101883.958351] st0: Can't allocate 2097152 byte tape buffer. > [103901.666608] st0: Can't allocate 10249541 byte tape buffer. > > No idea why it's trying 10MB, though. > > I tested with the patch from Fujita, and this messages from before > applying the patch: > > [158544.348411] st: append_to_buffer offset overflow. > > do not appear anymore. > It didn't help on the not-being-able-to-write-after-memory-pressure > matter, though. There is no way to guarantee that we can allocate physically continuous large memory. > > This isn't something we can work around > > in the driver because the transaction can't be split ... it has to go > > down as a single WRITE command with a single output data buffer. > > > > The LSI 1068 is an upgradeable firmware system, so it's always possible > > LSI can come up with a firmware update that increases the size (this > > would also require a corresponding driver change), but it doesn't sound > > to be something that can be done in the driver alone. > > If only LSI's website were a little more clear on where to find updated > firmware and what was the latest version :/. n I think that Desai said that your hardware can handle more sg entries. You have a better chance to live with memory pressure. Try the following patch with my patch. You need to set FUSION_MAX_SGE to 256 with kernel meunconfig (or whatever you like) and rebuild your kernel. Make sure that the driver supports 256 entries like this: fujita@calla:~$ cat /sys/class/scsi_host/host4/sg_tablesize 256 diff --git a/drivers/message/fusion/Kconfig b/drivers/message/fusion/Kconfig index a34a11d..e70d65e 100644 --- a/drivers/message/fusion/Kconfig +++ b/drivers/message/fusion/Kconfig @@ -61,9 +61,9 @@ config FUSION_SAS LSISAS1078 config FUSION_MAX_SGE - int "Maximum number of scatter gather entries (16 - 128)" + int "Maximum number of scatter gather entries (16 - 256)" default "128" - range 16 128 + range 16 256 help This option allows you to specify the maximum number of scatter- gather entries per I/O. The driver default is 128, which matches ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-03 18:10 ` James Bottomley 2010-12-05 10:53 ` Lukas Kolbe @ 2010-12-14 20:35 ` Vladislav Bolkhovitin 2010-12-14 22:23 ` Stephen Hemminger 1 sibling, 1 reply; 38+ messages in thread From: Vladislav Bolkhovitin @ 2010-12-14 20:35 UTC (permalink / raw) To: James Bottomley Cc: Lukas Kolbe, Kai Mäkisara, FUJITA Tomonori, linux-scsi, Kashyap Desai, netdev James Bottomley, on 12/03/2010 09:10 PM wrote: >>>> Thanks for noticing this bug. I hope this helps the users. The question >>>> about number of s/g segments is still valid for the direct i/o case but >>>> that is optimization and not whether one can read/write. >>> >>> Realistically, though, this will only increase the probability of making >>> an allocation work, we can't get this to a certainty. >>> >>> Since we fixed up the infrastructure to allow arbitrary length sg lists, >>> perhaps we should document what cards can actually take advantage of >>> this (and how to do so, since it's not set automatically on boot). That >>> way users wanting tapes at least know what the problems are likely to be >>> and how to avoid them in their hardware purchasing decisions. The >>> corollary is that we should likely have a list of not recommended cards: >>> if they can't go over 128 SG elements, then they're pretty much >>> unsuitable for modern tapes. >> >> Are you implying here that the LSI SAS1068E is unsuitable to drive two >> LTO-4 tape drives? Or is it 'just' a problem with the driver? > > The information seems to be the former. There's no way the kernel can > guarantee physical contiguity of memory as it operates. We try to > defrag, but it's probabalistic, not certain, so if we have to try to > find a physically contiguous buffer to copy into for an operation like > this, at some point that allocation is going to fail. What is interesting to me in this regard is how networking with 9K jumbo frames manages to work acceptably reliable? Jumbo frames used sufficiently often, including under high memory pressure. I'm not a deep networking guru, but network drivers need to allocate physically continual memory for skbs, which means 16K per 9K packet, which means order 2 allocations per skb. I guess, it works reliably, because for networking it is OK to drop an incoming packet and retry allocation for the next one later. If so, maybe similarly in this case it is worth to not return allocation error immediately, but retry it several times after few seconds intervals? Usually tape read/write operations have pretty big timeouts, like 60 seconds. In this time it is possible to retry 10 times in 5 seconds between retries. Vlad > The only way to be certain you can get a 2MB block down to a tape device > is to be able to transmit the whole thing as a SG list of fully > discontiguous pages. On a system with 4k pages, that requires 512 SG > entries. From what I've heard Kashyap say, that can't currently be done > on the 1068 because of firmware limitations (I'm not entirely clear on > this, but that's how it sounds to me ... if there is a way of making > firmware accept more than 128 SG elements per SCSI command, then it is a > fairly simple driver change). This isn't something we can work around > in the driver because the transaction can't be split ... it has to go > down as a single WRITE command with a single output data buffer. > > The LSI 1068 is an upgradeable firmware system, so it's always possible > LSI can come up with a firmware update that increases the size (this > would also require a corresponding driver change), but it doesn't sound > to be something that can be done in the driver alone. > > James ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-14 20:35 ` Vladislav Bolkhovitin @ 2010-12-14 22:23 ` Stephen Hemminger 2010-12-15 16:27 ` Vladislav Bolkhovitin 0 siblings, 1 reply; 38+ messages in thread From: Stephen Hemminger @ 2010-12-14 22:23 UTC (permalink / raw) To: Vladislav Bolkhovitin Cc: James Bottomley, Lukas Kolbe, Kai Mäkisara, FUJITA Tomonori, linux-scsi, Kashyap Desai, netdev On Tue, 14 Dec 2010 23:35:37 +0300 Vladislav Bolkhovitin <vst@vlnb.net> wrote: > What is interesting to me in this regard is how networking with 9K jumbo > frames manages to work acceptably reliable? Jumbo frames used > sufficiently often, including under high memory pressure. > > I'm not a deep networking guru, but network drivers need to allocate > physically continual memory for skbs, which means 16K per 9K packet, > which means order 2 allocations per skb. Good network drivers support fragmentation and allocate a small portion for the header and allocate pages for the rest. This requires no higher order allocation. The networking stack takes fragmented data coming in and does the necessary copy/merging to access contiguous headers. There are still some crap network drivers that require large contiguous allocation. These should not be used with jumbo frames in real environments. -- ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: After memory pressure: can't read from tape anymore 2010-12-14 22:23 ` Stephen Hemminger @ 2010-12-15 16:27 ` Vladislav Bolkhovitin 0 siblings, 0 replies; 38+ messages in thread From: Vladislav Bolkhovitin @ 2010-12-15 16:27 UTC (permalink / raw) To: Stephen Hemminger Cc: James Bottomley, Lukas Kolbe, Kai Mäkisara, FUJITA Tomonori, linux-scsi, Kashyap Desai, netdev Stephen Hemminger, on 12/15/2010 01:23 AM wrote: > On Tue, 14 Dec 2010 23:35:37 +0300 > Vladislav Bolkhovitin <vst@vlnb.net> wrote: > >> What is interesting to me in this regard is how networking with 9K jumbo >> frames manages to work acceptably reliable? Jumbo frames used >> sufficiently often, including under high memory pressure. >> >> I'm not a deep networking guru, but network drivers need to allocate >> physically continual memory for skbs, which means 16K per 9K packet, >> which means order 2 allocations per skb. > > Good network drivers support fragmentation and allocate a small portion > for the header and allocate pages for the rest. This requires no higher > order allocation. The networking stack takes fragmented data coming > in and does the necessary copy/merging to access contiguous headers. > > There are still some crap network drivers that require large contiguous > allocation. These should not be used with jumbo frames in real > environments. I see. Thanks for clarifying it. Vlad ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2010-12-15 16:27 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2010-11-28 19:15 After memory pressure: can't read from tape anymore Lukas Kolbe 2010-11-29 17:09 ` Kai Makisara 2010-11-30 13:31 ` Lukas Kolbe 2010-11-30 16:10 ` Boaz Harrosh 2010-11-30 16:23 ` Kai Makisara 2010-11-30 16:44 ` Boaz Harrosh 2010-11-30 17:04 ` Kai Makisara 2010-11-30 17:24 ` Boaz Harrosh 2010-11-30 19:53 ` Kai Makisara 2010-12-01 9:40 ` Lukas Kolbe 2010-12-02 11:17 ` Desai, Kashyap 2010-12-02 16:22 ` Kai Makisara 2010-12-02 18:14 ` Desai, Kashyap 2010-12-02 20:25 ` Kai Makisara 2010-12-05 10:44 ` Lukas Kolbe 2010-12-03 10:13 ` FUJITA Tomonori 2010-12-03 10:45 ` Desai, Kashyap 2010-12-03 11:11 ` FUJITA Tomonori 2010-12-02 10:01 ` Lukas Kolbe 2010-12-03 9:44 ` FUJITA Tomonori 2010-11-30 16:20 ` Kai Makisara 2010-12-01 17:06 ` Lukas Kolbe 2010-12-02 16:41 ` Kai Makisara 2010-12-06 7:59 ` Kai Makisara 2010-12-06 8:50 ` FUJITA Tomonori 2010-12-06 9:36 ` Lukas Kolbe 2010-12-06 11:34 ` Bjørn Mork 2010-12-08 14:19 ` Lukas Kolbe 2010-12-03 12:27 ` FUJITA Tomonori 2010-12-03 14:59 ` Kai Mäkisara 2010-12-03 15:06 ` James Bottomley 2010-12-03 17:03 ` Lukas Kolbe 2010-12-03 18:10 ` James Bottomley 2010-12-05 10:53 ` Lukas Kolbe 2010-12-05 12:16 ` FUJITA Tomonori 2010-12-14 20:35 ` Vladislav Bolkhovitin 2010-12-14 22:23 ` Stephen Hemminger 2010-12-15 16:27 ` Vladislav Bolkhovitin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.