All of lore.kernel.org
 help / color / mirror / Atom feed
* IB: increase DMA max_segment_size on Mellanox hardware
@ 2011-01-17  2:09 David Dillow
       [not found] ` <1295230184.2574.26.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: David Dillow @ 2011-01-17  2:09 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Roland Dreier, Eli Cohen

By default, each device is assumed to be able only handle 64 KB chunks
during DMA. By giving the segment size a larger value, the block layer
will coalesce more S/G entries together for SRP, allowing larger
requests with the same sg_tablesize setting.

This ups the value on Mellanox hardware to 2 GB, though there is no HW
limit.

Signed-off-by: David Dillow <dillowda-1Heg1YXhbW8@public.gmane.org>
--
It could be 3 GB or even (4 GB - 1), since it's an unsigned int, but we
should rarely see a 2 GB request anyways.

I think it is probably safe to do something similar for the qib and
ipath drivers, and perhaps the iWarp cards for iSER.

This patch also reduces the iterations in SRP's mapping code for some
upcoming patches I have.

diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 5eee666..deb79d4 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -1043,6 +1043,9 @@ static int __mthca_init_one(struct pci_dev *pdev, int hca_type)
 		}
 	}
 
+	/* not a HW limit, but no way to say infinity */
+	dma_set_max_seg_size(&pdev->dev, 2 * 1024 * 1024 * 1024);
+
 	mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev);
 	if (!mdev) {
 		dev_err(&pdev->dev, "Device struct alloc failed, "
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 782f11d..699852e 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -1109,6 +1109,9 @@ static int __mlx4_init_one(struct pci_dev *pdev, const struct pci_device_id *id)
 		}
 	}
 
+	/* not a HW limit, but no way to say infinity */
+	dma_set_max_seg_size(&pdev->dev, 2 * 1024 * 1024 * 1024);
+
 	priv = kzalloc(sizeof *priv, GFP_KERNEL);
 	if (!priv) {
 		dev_err(&pdev->dev, "Device struct alloc failed, "


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found] ` <1295230184.2574.26.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2011-01-17 11:16   ` Or Gerlitz
       [not found]     ` <4D342510.80701-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  2011-03-22  8:01   ` Or Gerlitz
  1 sibling, 1 reply; 17+ messages in thread
From: Or Gerlitz @ 2011-01-17 11:16 UTC (permalink / raw)
  To: David Dillow; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Eli Cohen

David Dillow wrote:
> By default, each device is assumed to be able only handle 64 KB chunks during DMA.
> By giving the segment size a larger value, the block layer
> will coalesce more S/G entries together for SRP, allowing larger
> requests with the same sg_tablesize setting.

Hi Dave,

I'd be happy if you can elaborate more or point me to some article which 
explains how this setting effects the kernel sg coalescing code, specifically -
when the block layer considers both the scsi sg_tablesize advertised by the SCSI
LLD for a scsi host H and the max_set_size set for the dma device used by H, does
it use a minimum function? 

Looking on the code of block/blk-merge.c :: blk_rq_map_sg , which I was thinking to be 
relevant here, I don't see any reference to sg_table_size / max_sectors, since its 
block layer, non scsi specific code, so I guess that for scsi devices, there's 
some over-ruling done before/after that code runs?

In iser we want to support up to 512KB IOs, so sg_tablesize is 128 (=512>>12)
which on systems with 4K page size accounts to 512K totally (we also set max_sectors 
to 1024, so on systems with 16K or whatever page size, we'll not get > 512K IOs).

When running actual IO tests we do see the block layer sending down up to
512K IOs, even with the current / default 64K max_seg_size, so I wasn't sure
what your patch buys, taking into account my minima assumption, thoughts?

> This ups the value on Mellanox hardware to 2 GB, though there is no HW limit.
> It could be 3 GB or even (4 GB - 1), since it's an unsigned int, but we
> should rarely see a 2 GB request anyways.

I think that typically systems will not issue IOs larger then 1/2/4 MBs for
bunch of reasons, so unless I miss somebigthing here, 1/2/(4-1)GB is thousand
times these quantities...

> This patch also reduces the iterations in SRP's mapping code for some
> upcoming patches I have.

These patches aren't targeted for the 2.6.38 merge window, correct? in that
case, I'd like to let us some time to discuss / better understand the low level
drivers patch and not merge it for this window, agree?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]     ` <4D342510.80701-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2011-01-18  2:46       ` David Dillow
       [not found]         ` <1295318783.3051.81.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: David Dillow @ 2011-01-18  2:46 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Eli Cohen

Hi Or!

On Mon, 2011-01-17 at 13:16 +0200, Or Gerlitz wrote:
> David Dillow wrote:
> > By default, each device is assumed to be able only handle 64 KB chunks during DMA.
> > By giving the segment size a larger value, the block layer
> > will coalesce more S/G entries together for SRP, allowing larger
> > requests with the same sg_tablesize setting.
> 
> Hi Dave,
> 
> I'd be happy if you can elaborate more or point me to some article which 
> explains how this setting effects the kernel sg coalescing code, specifically -
> when the block layer considers both the scsi sg_tablesize advertised by the SCSI
> LLD for a scsi host H and the max_set_size set for the dma device used by H, does
> it use a minimum function? 
> 
> Looking on the code of block/blk-merge.c :: blk_rq_map_sg , which I was thinking to be 
> relevant here, I don't see any reference to sg_table_size / max_sectors, since its 
> block layer, non scsi specific code, so I guess that for scsi devices, there's 
> some over-ruling done before/after that code runs?

We're talking about different things --
max_segments(sg_tablesize)/max_sectors are the limits as we're adding
pages to the bio via bio_add_page(). blk_rq_map_sg() uses
max_segment_size as a bound on the largest S/G entry into which it can
coalesce multiple entries from the BIO.

It is considered in the line
	if (sg->length + nbytes > queue_max_segment_size(q))
		goto new_segment;

max_segment_size for SCSI devices gets by the DMA values from
shost->shost_gendev.parent in __scsi_alloc_queue().
shost->shost_gendev.parent is set to ib_device->dma_device by SRP when
it calls scsi_add_host(). ib_device->dma_device is set to &pdev->dev by
the underlying hardware drivers.

> In iser we want to support up to 512KB IOs, so sg_tablesize is 128 (=512>>12)
> which on systems with 4K page size accounts to 512K totally (we also set max_sectors 
> to 1024, so on systems with 16K or whatever page size, we'll not get > 512K IOs).

Yes, but without this change, you will get your 512 KB request in 8 S/G
entries minimum, when it could be in one if contiguous. For our systems
where we're trying to get 1 MB or larger IOs over SRP, we get 16 S/G
entries when we could get one, potentially forcing us into using FMRs
and doing additional work when we could just map the single entry
directly.

> > This ups the value on Mellanox hardware to 2 GB, though there is no HW limit.
> > It could be 3 GB or even (4 GB - 1), since it's an unsigned int, but we
> > should rarely see a 2 GB request anyways.
> 
> I think that typically systems will not issue IOs larger then 1/2/4 MBs for
> bunch of reasons, so unless I miss somebigthing here, 1/2/(4-1)GB is thousand
> times these quantities...

The size could be 1, 2, 4, 8 MB or whatever. But if there isn't a HW
limit, then why limit yourself at all? Let the block code merge the
contiguous entries as much as it likes. Not merging entries causes us to
do more work unnecessarily -- either by taking a more heavy-weight path
or by walking more S/G entries, possibly doing your own coalescing of
contiguous S/G entries, etc.

I noticed this when working on SRP, as I didn't want to have to iterate
over contiguous S/G entries to notice when they were larger than an FMR
could map, and should be put in the indirect table directly. I thought
that was something the block layer should be doing for me, and looked
into why it wasn't happening. I also found a lack of coalescing in an
IOMMU driver, but that's a different email. :)

To be sure, this is not going to be a huge efficiency gain, but it does
reduce CPU usage and raises the largest possible IO size with the
default srp_sg_tablesize of 12 from 768 KB to some 24 GB. Not that we'd
likely need that large of a request or be able to allocate 12 2GB
contiguous memory areas, of course.

> > This patch also reduces the iterations in SRP's mapping code for some
> > upcoming patches I have.
> 
> These patches aren't targeted for the 2.6.38 merge window, correct? in that
> case, I'd like to let us some time to discuss / better understand the low level
> drivers patch and not merge it for this window, agree?

Sorry, I wasn't clear -- this wasn't aimed at this merge window.

I hope I've explained things a bit better -- please let me know if not.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]         ` <1295318783.3051.81.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2011-01-18  5:22           ` Jason Gunthorpe
  2011-01-18  7:16           ` Or Gerlitz
  2011-01-20 10:18           ` Or Gerlitz
  2 siblings, 0 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2011-01-18  5:22 UTC (permalink / raw)
  To: David Dillow
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Eli Cohen

On Mon, Jan 17, 2011 at 09:46:23PM -0500, David Dillow wrote:

> To be sure, this is not going to be a huge efficiency gain, but it does
> reduce CPU usage and raises the largest possible IO size with the
> default srp_sg_tablesize of 12 from 768 KB to some 24 GB. Not that we'd
> likely need that large of a request or be able to allocate 12 2GB
> contiguous memory areas, of course.

AH! So this is why.. I went to some trouble to add huge page support
to some code expecting to get bigger/more consisent SRP IOs and didn't
quite get that result :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]         ` <1295318783.3051.81.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  2011-01-18  5:22           ` Jason Gunthorpe
@ 2011-01-18  7:16           ` Or Gerlitz
       [not found]             ` <4D353E5E.2020500-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  2011-01-20 10:18           ` Or Gerlitz
  2 siblings, 1 reply; 17+ messages in thread
From: Or Gerlitz @ 2011-01-18  7:16 UTC (permalink / raw)
  To: David Dillow; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Eli Cohen

David Dillow wrote:
> We're talking about different things -- max_segments(sg_tablesize)/max_sectors are 
> the limits as we're adding pages to the bio via bio_add_page(). blk_rq_map_sg() uses
> max_segment_size as a bound on the largest S/G entry into which it can
> coalesce multiple entries from the BIO.
> It is considered in the line
> 	if (sg->length + nbytes > queue_max_segment_size(q))
> 		goto new_segment;

Dave, thanks for the detailed explanation, I understand it much better now, just
to make sure, blk_rq_map_sg() is called through the flow of dma_map_sg, correct?

If this is the case, we're talking on decision making done by the block layer
during dma-mapping which later affect the "IB IOMMU mapping" at the IB driver 
(e.g srp, iser, etc).

>> In iser we want to support up to 512KB IOs, so sg_tablesize is 128 (=512>>12)
>> which on systems with 4K page size accounts to 512K totally (we also set max_sectors 
>> to 1024, so on systems with 16K or whatever page size, we'll not get > 512K IOs).

> Yes, but without this change, you will get your 512 KB request in 8 S/G
> entries minimum, when it could be in one if contiguous. For our systems
> where we're trying to get 1 MB or larger IOs over SRP, we get 16 S/G
> entries when we could get one, potentially forcing us into using FMRs
> and doing additional work when we could just map the single entry directly.

Since the block layer did its best to coalesce multiple entries from the BIO
to SG(s), you would need to FMR whenever dma_map_sg returns value > 1 

As you mention later on, I wonder what would be the benefit from not using FMRs 
as we're talking on large IOs (> 64K, by the assumption that the block it coalesces 
today BIOs that allow that, i.e their pages are contiguous), for which I would 
expect latency, bandwidth and IOPS not to be effected by no-FMRing them. So we're
remained with the CPU usage saving, do you have (say) "vmstat 1" snapshots before/after
this patch with the ~same IO tool/load that can help quantify this saving?

Also when working with direct I/O from user space and/or under file-system,
did you really see many BIOs that can be merged? I was under the impression,
that (specifically after some time the system is active) for the most case, 
I get totally scattered SGs, whose pages can't be coalesced at all.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]             ` <4D353E5E.2020500-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2011-01-18 12:25               ` David Dillow
       [not found]                 ` <1295353524.3051.121.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: David Dillow @ 2011-01-18 12:25 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Eli Cohen

On Tue, 2011-01-18 at 09:16 +0200, Or Gerlitz wrote:
> David Dillow wrote:
> > We're talking about different things -- max_segments(sg_tablesize)/max_sectors are 
> > the limits as we're adding pages to the bio via bio_add_page(). blk_rq_map_sg() uses
> > max_segment_size as a bound on the largest S/G entry into which it can
> > coalesce multiple entries from the BIO.
> > It is considered in the line
> > 	if (sg->length + nbytes > queue_max_segment_size(q))
> > 		goto new_segment;
> 
> Dave, thanks for the detailed explanation, I understand it much better now, just
> to make sure, blk_rq_map_sg() is called through the flow of dma_map_sg, correct?

For SCSI commands, it is called in scsi_init_sgtable(), which is called
by scsi_init_io(), called in turn by scsi_setup_blk_pc_cmnd() and
scsi_setup_fs_cmnd(). Those are called by various function in the sd
driver, notably sd_prep_fn() which is the block queue prep function
called for each request.

> If this is the case, we're talking on decision making done by the block layer
> during dma-mapping which later affect the "IB IOMMU mapping" at the IB driver 
> (e.g srp, iser, etc).

Correct, this is all done before srp_queuecommand() is called, and
before any DMA mapping is done. This is before ib_dma_map_sg() is
called, and reduces the number of entries in the S/G list passed to that
function, but the total data will be the same.

> >> In iser we want to support up to 512KB IOs, so sg_tablesize is 128 (=512>>12)
> >> which on systems with 4K page size accounts to 512K totally (we also set max_sectors 
> >> to 1024, so on systems with 16K or whatever page size, we'll not get > 512K IOs).
> 
> > Yes, but without this change, you will get your 512 KB request in 8 S/G
> > entries minimum, when it could be in one if contiguous. For our systems
> > where we're trying to get 1 MB or larger IOs over SRP, we get 16 S/G
> > entries when we could get one, potentially forcing us into using FMRs
> > and doing additional work when we could just map the single entry directly.
> 
> Since the block layer did its best to coalesce multiple entries from the BIO
> to SG(s), you would need to FMR whenever dma_map_sg returns value > 1

Most likely, but there's no point in using an FMR for a individual S/G
entry larger than the FMR size -- it's just extra work and consumes
unneeded resources. That's not an issue in the current code, but is a
small optimization in my new mapping code, which I hope to post later
today.

> As you mention later on, I wonder what would be the benefit from not using FMRs 
> as we're talking on large IOs (> 64K, by the assumption that the block it coalesces 
> today BIOs that allow that, i.e their pages are contiguous), for which I would 
> expect latency, bandwidth and IOPS not to be effected by no-FMRing them. So we're
> remained with the CPU usage saving, do you have (say) "vmstat 1" snapshots before/after
> this patch with the ~same IO tool/load that can help quantify this saving?

No, and I'm pretty sure the savings won't show up in such a coarse
measurement. I'm not even sure the CPU overhead is even above the noise
floor, but it would seem to be an obvious savings, however minor. In
addition, we are currently breaking up IOs and requiring FMR even when
the entire region is contiguous. I could piece those back together in
the initiator, but that's simply extra work that is duplicative of the
work done by the block layer, wasting further CPU cycles.

> Also when working with direct I/O from user space and/or under file-system,
> did you really see many BIOs that can be merged? I was under the impression,
> that (specifically after some time the system is active) for the most case, 
> I get totally scattered SGs, whose pages can't be coalesced at all.

I would expect that to be a common case, but there are systems out there
that this is not an issue. They typically allocate buffers early on when
they can be contiguous, and just keep reusing those.

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]                 ` <1295353524.3051.121.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2011-01-18 18:01                   ` Jason Gunthorpe
  2011-01-20 10:14                   ` Or Gerlitz
  1 sibling, 0 replies; 17+ messages in thread
From: Jason Gunthorpe @ 2011-01-18 18:01 UTC (permalink / raw)
  To: David Dillow
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Eli Cohen

On Tue, Jan 18, 2011 at 07:25:24AM -0500, David Dillow wrote:

> > Also when working with direct I/O from user space and/or under file-system,
> > did you really see many BIOs that can be merged? I was under the impression,
> > that (specifically after some time the system is active) for the most case, 
> > I get totally scattered SGs, whose pages can't be coalesced at all.
> 
> I would expect that to be a common case, but there are systems out there
> that this is not an issue. They typically allocate buffers early on when
> they can be contiguous, and just keep reusing those.

It has become really easy to use hugetlb now, you can just mmap
anonymous memory with MAP_HUGETLB and you get 2M pages. If an app is
using O_DIRECT it seem worth making an attempt to use huge pages..

IIRC databases are already doing this.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]                 ` <1295353524.3051.121.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  2011-01-18 18:01                   ` Jason Gunthorpe
@ 2011-01-20 10:14                   ` Or Gerlitz
       [not found]                     ` <4D380AF2.4010905-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: Or Gerlitz @ 2011-01-20 10:14 UTC (permalink / raw)
  To: David Dillow; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Eli Cohen

David Dillow wrote:
> On Tue, 2011-01-18 at 09:16 +0200, Or Gerlitz wrote:
>> David Dillow wrote:

> For SCSI commands, it is called in scsi_init_sgtable(), which is called
> by scsi_init_io(), called in turn by scsi_setup_blk_pc_cmnd() and
> scsi_setup_fs_cmnd(). Those are called by various function in the sd
> driver, notably sd_prep_fn() which is the block queue prep function
> called for each request

funny, if I understand this correct, assuming each scsi_host has a dma device
associated with it, sg_dma_map/unmap can be taken out from the SCSI LLD code
and be done in higher levels, just have someone scan everyone queuecommand and
command response code flows to clean that out, anyway

> Most likely, but there's no point in using an FMR for a individual S/G
> entry larger than the FMR size -- it's just extra work and consumes
> unneeded resources. That's not an issue in the current code, but is a
> small optimization in my new mapping code, which I hope to post later today.

I'm still trying to understand the bigger picture with your patch set and what
role the mlx4/mthca patch has in it, here you mention "small optimization" but
the srp patchset has up to 50% improvement in some flows, so what's the actual
part of this patch? e.g what patch/es in the srp patch series take advantage of
this fix?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]         ` <1295318783.3051.81.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  2011-01-18  5:22           ` Jason Gunthorpe
  2011-01-18  7:16           ` Or Gerlitz
@ 2011-01-20 10:18           ` Or Gerlitz
       [not found]             ` <4D380BDC.5080002-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  2 siblings, 1 reply; 17+ messages in thread
From: Or Gerlitz @ 2011-01-20 10:18 UTC (permalink / raw)
  To: David Dillow; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

David Dillow wrote:
> The size could be 1, 2, 4, 8 MB or whatever. But if there isn't a HW
> limit, then why limit yourself at all? 

There are others limits in the echo system this patch live with, e.g some
buffer allocation and usage scheme with IB RC done by the target... IB
RC message is limited by spec to 2G and with mlx4 to 1GB, what not remaining
in the (say) 8MB area with this patch as well?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]                     ` <4D380AF2.4010905-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2011-01-20 12:15                       ` David Dillow
       [not found]                         ` <1295525737.22825.24.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: David Dillow @ 2011-01-20 12:15 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Eli Cohen

On Thu, 2011-01-20 at 12:14 +0200, Or Gerlitz wrote:
> David Dillow wrote:
> > On Tue, 2011-01-18 at 09:16 +0200, Or Gerlitz wrote:
> >> David Dillow wrote:
> 
> > For SCSI commands, it is called in scsi_init_sgtable(), which is called
> > by scsi_init_io(), called in turn by scsi_setup_blk_pc_cmnd() and
> > scsi_setup_fs_cmnd(). Those are called by various function in the sd
> > driver, notably sd_prep_fn() which is the block queue prep function
> > called for each request
> 
> funny, if I understand this correct, assuming each scsi_host has a dma device
> associated with it, sg_dma_map/unmap can be taken out from the SCSI LLD code
> and be done in higher levels, just have someone scan everyone queuecommand and
> command response code flows to clean that out, anyway

Perhaps, except for the need for different devices to call different map
commands, such as ib_dma_*() and sbus_dma_*() etc. There may be other
reasons as well.

> > Most likely, but there's no point in using an FMR for a individual S/G
> > entry larger than the FMR size -- it's just extra work and consumes
> > unneeded resources. That's not an issue in the current code, but is a
> > small optimization in my new mapping code, which I hope to post later today.
> 
> I'm still trying to understand the bigger picture with your patch set and what
> role the mlx4/mthca patch has in it, here you mention "small optimization" but
> the srp patchset has up to 50% improvement in some flows, so what's the actual
> part of this patch? e.g what patch/es in the srp patch series take advantage of
> this fix?

This patch is orthogonal to the SRP mapping changes and was not used for
the performance numbers listed -- the 50% improvement was achieved with
_no_ coalescing in the SG lists. Each 4 KB page was its own SG entry.
That means that in the case of contiguous memory, the SRP code walked 16
times more entries than it would currently, and this change would have
reduced it to examining one SG entry.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]             ` <4D380BDC.5080002-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2011-01-20 12:25               ` David Dillow
  0 siblings, 0 replies; 17+ messages in thread
From: David Dillow @ 2011-01-20 12:25 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, 2011-01-20 at 12:18 +0200, Or Gerlitz wrote:
> David Dillow wrote:
> > The size could be 1, 2, 4, 8 MB or whatever. But if there isn't a HW
> > limit, then why limit yourself at all? 
> 
> There are others limits in the echo system this patch live with, e.g some
> buffer allocation and usage scheme with IB RC done by the target...

Keep in mind this is not directly seen by the target -- this is purely
an optimization for the block stack. The target has to deal with > 1 MB
requests from an initiator anyway, or it is buggy -- we can generate
those today, just not reliably. And we have the proper tools to deal
with that -- it's called max_hw_sectors_kb, as set by the max_sect
option to SRP.

>  IB RC message is limited by spec to 2G and with mlx4 to 1GB, what not remaining
> in the (say) 8MB area with this patch as well?

I would call that a HW/spec limit, and would be fine if mlx4 set it to 1
GB. Or even 512 MB or 256 MB. 8 MB would work, but to give room to grow
I'd rather see it higher.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]                         ` <1295525737.22825.24.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2011-01-24 15:18                           ` Or Gerlitz
       [not found]                             ` <4D3D9845.4050803-smomgflXvOZWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: Or Gerlitz @ 2011-01-24 15:18 UTC (permalink / raw)
  To: David Dillow
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Eli Cohen

David Dillow wrote:
>> funny, if I understand this correct, assuming each scsi_host has a dma device
>> associated with it, sg_dma_map/unmap can be taken out from the SCSI LLD code
>> and be done in higher levels, just have someone scan everyone queuecommand and
>> command response code flows to clean that out, anyway
> 
> Perhaps, except for the need for different devices to call different map
> commands, such as ib_dma_*() and sbus_dma_*() etc. There may be other
> reasons as well.

yes, I tend to think that there could be other reasons, else, if this (calling dma mapping from the SCSI LLD queuecommand code) is just something which was needed on the 2.4 days but not any more, day will come and someone will pick the glove and make this cleanup. 

BTW - ib_dma_map_xxx and friends are considered by me as bug and not a feature, it was push by the ipath/qib guys and I was holding a minority opinion that we need to stick to dma_map_xxx and have these drivers implement their own software IOMMU emulation. The reason my opinion wasn't accepted is that they're supported only on 64 bit platforms over which (that was the claim back then) each page has a kernel virtual address associated with it, oh well

>> I'm still trying to understand the bigger picture with your patch set and what
>> role the mlx4/mthca patch has in it

> This patch is orthogonal to the SRP mapping changes and was not used for
> the performance numbers listed -- the 50% improvement was achieved with
> _no_ coalescing in the SG lists. Each 4 KB page was its own SG entry.

I see, thanks for clarifying this out load and clear, so maybe you'll first get the seven srp patches reviewed and merged, and only then see if/what benefit this patch brings, I'm a little bit worried of changing something below everyone's (srp, iser, rds, p9, nfs-rdma, lustre, etc) legs which doesn't have any notable benefit, I would be happy to hear others/opinions

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]                             ` <4D3D9845.4050803-smomgflXvOZWk0Htik3J/w@public.gmane.org>
@ 2011-01-24 19:36                               ` David Dillow
       [not found]                                 ` <1295897819.29946.17.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: David Dillow @ 2011-01-24 19:36 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Eli Cohen

On Mon, 2011-01-24 at 10:18 -0500, Or Gerlitz wrote:
> David Dillow wrote:
> > Perhaps, except for the need for different devices to call different map
> > commands, such as ib_dma_*() and sbus_dma_*() etc. There may be other
> > reasons as well.

> BTW - ib_dma_map_xxx and friends are considered by me as bug and not a
> feature,

I'm not a fan of them either, but I've not gone looking to see if I
could find a better solution either.

> >> I'm still trying to understand the bigger picture with your patch set and what
> >> role the mlx4/mthca patch has in it
> 
> > This patch is orthogonal to the SRP mapping changes and was not used for
> > the performance numbers listed -- the 50% improvement was achieved with
> > _no_ coalescing in the SG lists. Each 4 KB page was its own SG entry.
> 
> I see, thanks for clarifying this out load and clear, so maybe you'll
> first get the seven srp patches reviewed and merged, and only then see
> if/what benefit this patch brings, I'm a little bit worried of
> changing something below everyone's (srp, iser, rds, p9, nfs-rdma,
> lustre, etc) legs which doesn't have any notable benefit, I would be
> happy to hear others/opinions

Well, I believe it only affects the block merging -- I have to catch a
flight, so I cannot recheck right this moment -- that limits the effects
to SRP and iSER. The others don't use it at all.

We know SRP is fine, and a quick glance suggests that iSER is as well --
you limit the maximum IO size, so you don't get too large of request
lists anyways. If you were to expand that, the SCSI midlayer provides
hooks that could be used to clamp down the DMA size for devices hanging
off iSER. SRP cannot do that, as while it is safe to go to a smaller
value than the hardware supports, it is not safe to try and increase it
from the default.

I'll try to recheck once I get a few moments someplace quiet.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]                                 ` <1295897819.29946.17.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
@ 2011-02-17  1:06                                   ` David Dillow
       [not found]                                     ` <1297904810.8833.44.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
  0 siblings, 1 reply; 17+ messages in thread
From: David Dillow @ 2011-02-17  1:06 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Eli Cohen

On Mon, 2011-01-24 at 14:36 -0500, David Dillow wrote:
> On Mon, 2011-01-24 at 10:18 -0500, Or Gerlitz wrote:
> > I see, thanks for clarifying this out load and clear, so maybe you'll
> > first get the seven srp patches reviewed and merged, and only then see
> > if/what benefit this patch brings, I'm a little bit worried of
> > changing something below everyone's (srp, iser, rds, p9, nfs-rdma,
> > lustre, etc) legs which doesn't have any notable benefit, I would be
> > happy to hear others/opinions
> 
> Well, I believe it only affects the block merging -- I have to catch a
> flight, so I cannot recheck right this moment -- that limits the effects
> to SRP and iSER. The others don't use it at all.

Ok, finally got a chance to look into this a bit further. The block
layer is the only direct user of it, though a few IOMMU drivers
reference it as well for their *_map_sg coalescing code. pci-gart_64 on
x86, and a smattering on on sparc, powerpc, and ia64.

Since other IB protocols could potentially see larger segments with
this, let's check those:

iSER is fine, because you limit your maximum request size to 512 KB, so
we'll never overrun the page vector in struct iser_page_vec (128 entries
currently). It is independent of the DMA segment size, and handles
multi-page segments already.

IPoIB is fine, as it maps each page individually, and doesn't use
ib_dma_map_sg().

RDS appears to do the right thing and has no dependencies on DMA segment
size, but I don't claim to have done a complete audit.

NFSoRDMA and 9p are OK -- they do not use ib_dma_map_sg(), so they
doesn't care about the coalescing.

Lustre's ko2iblnd does not care about coalescing -- it properly walks
the returned sg list.


I don't see anything that would have a problem with larger DMA segment
sizes; some of them would see small efficiency gains similar to SRP if
more small segments could be coalesced into fewer larger ones.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]                                     ` <1297904810.8833.44.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
@ 2011-03-16 19:17                                       ` Or Gerlitz
  0 siblings, 0 replies; 17+ messages in thread
From: Or Gerlitz @ 2011-03-16 19:17 UTC (permalink / raw)
  To: David Dillow, Roland Dreier; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Eli Cohen

On Thu, Feb 17, 2011 at 3:06 AM, David Dillow <dillowda-1Heg1YXhbW8@public.gmane.org> wrote:
>> I don't see anything that would have a problem with larger DMA segment
>> sizes; some of them would see small efficiency gains similar to SRP if
>> more small segments could be coalesced into fewer larger ones.

>Is there still any concern about that patch possibly being wrong?
>Because I'd be inclined to merge it at this point.

Roland, what's your take on this patch? also I didn't had the chance
to test with it, I will be able to do so early next week, so when you
intend to send the 2nd pull request for this window?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found] ` <1295230184.2574.26.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  2011-01-17 11:16   ` Or Gerlitz
@ 2011-03-22  8:01   ` Or Gerlitz
       [not found]     ` <4D885748.6060600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 17+ messages in thread
From: Or Gerlitz @ 2011-03-22  8:01 UTC (permalink / raw)
  To: David Dillow; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Eli Cohen

David Dillow wrote:
> This ups the value on Mellanox hardware to 2 GB, though there is no HW
> limit.

You've expressed intention to reduce that value to 1GB as of HW 
limitations, it would also let you get rid from the warnings on the 
current version

>  CC [M]  drivers/infiniband/hw/mthca/mthca_main.o
> drivers/infiniband/hw/mthca/mthca_main.c: In function ‘__mthca_init_one’:
> drivers/infiniband/hw/mthca/mthca_main.c:1047: warning: integer overflow in expression
>   LD [M]  drivers/infiniband/hw/mthca/ib_mthca.o
>   CC [M]  drivers/net/mlx4/main.o
> drivers/net/mlx4/main.c: In function ‘__mlx4_init_one’:
> drivers/net/mlx4/main.c:1113: warning: integer overflow in expression

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: IB: increase DMA max_segment_size on Mellanox hardware
       [not found]     ` <4D885748.6060600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2011-03-22 12:51       ` David Dillow
  0 siblings, 0 replies; 17+ messages in thread
From: David Dillow @ 2011-03-22 12:51 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Eli Cohen

By default, each device is assumed to be able only handle 64 KB chunks
during DMA. By giving the segment size a larger value, the block layer
will coalesce more S/G entries together for SRP, allowing larger
requests with the same sg_tablesize setting.

This ups the value on Mellanox hardware to 1 GB, corresponding to
reported firmware limits on mlx4.

Signed-off-by: David Dillow <dillowda-1Heg1YXhbW8@public.gmane.org>
--

On Tue, 2011-03-22 at 10:01 +0200, Or Gerlitz wrote:
> David Dillow wrote:
> > This ups the value on Mellanox hardware to 2 GB, though there is no HW
> > limit.
> 
> You've expressed intention to reduce that value to 1GB as of HW 
> limitations, it would also let you get rid from the warnings on the 
> current version
> 
> >  CC [M]  drivers/infiniband/hw/mthca/mthca_main.o
> > drivers/infiniband/hw/mthca/mthca_main.c: In function ‘__mthca_init_one’:
> > drivers/infiniband/hw/mthca/mthca_main.c:1047: warning: integer overflow in expression

Well, that's embarrassing. I don't recall seeing those when I tested,
but I don't see how the compiler could have not complained... here's an
updated patch -- against Linus's tree as of a few days ago; work is
currently blocking my git access on wireless. :/

diff --git a/drivers/infiniband/hw/mthca/mthca_main.c b/drivers/infiniband/hw/mthca/mthca_main.c
index 8a40cd5..f24b79b 100644
--- a/drivers/infiniband/hw/mthca/mthca_main.c
+++ b/drivers/infiniband/hw/mthca/mthca_main.c
@@ -1043,6 +1043,9 @@ static int __mthca_init_one(struct pci_dev *pdev, int hca_type)
 		}
 	}
 
+	/* We can handle large RDMA requests, so allow larger segments. */
+	dma_set_max_seg_size(&pdev->dev, 1024 * 1024 * 1024);
+
 	mdev = (struct mthca_dev *) ib_alloc_device(sizeof *mdev);
 	if (!mdev) {
 		dev_err(&pdev->dev, "Device struct alloc failed, "
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index 2765a3c..c835011 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -1109,6 +1109,9 @@ static int __mlx4_init_one(struct pci_dev *pdev, const struct pci_device_id *id)
 		}
 	}
 
+	/* Allow large DMA segments, up to the firmware limit of 1 GB */
+	dma_set_max_seg_size(&pdev->dev, 1024 * 1024 * 1024);
+
 	priv = kzalloc(sizeof *priv, GFP_KERNEL);
 	if (!priv) {
 		dev_err(&pdev->dev, "Device struct alloc failed, "

-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-03-22 12:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-17  2:09 IB: increase DMA max_segment_size on Mellanox hardware David Dillow
     [not found] ` <1295230184.2574.26.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2011-01-17 11:16   ` Or Gerlitz
     [not found]     ` <4D342510.80701-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2011-01-18  2:46       ` David Dillow
     [not found]         ` <1295318783.3051.81.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2011-01-18  5:22           ` Jason Gunthorpe
2011-01-18  7:16           ` Or Gerlitz
     [not found]             ` <4D353E5E.2020500-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2011-01-18 12:25               ` David Dillow
     [not found]                 ` <1295353524.3051.121.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2011-01-18 18:01                   ` Jason Gunthorpe
2011-01-20 10:14                   ` Or Gerlitz
     [not found]                     ` <4D380AF2.4010905-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2011-01-20 12:15                       ` David Dillow
     [not found]                         ` <1295525737.22825.24.camel-1q1vX8mYZiGLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2011-01-24 15:18                           ` Or Gerlitz
     [not found]                             ` <4D3D9845.4050803-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2011-01-24 19:36                               ` David Dillow
     [not found]                                 ` <1295897819.29946.17.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
2011-02-17  1:06                                   ` David Dillow
     [not found]                                     ` <1297904810.8833.44.camel-FqX9LgGZnHWDB2HL1qBt2PIbXMQ5te18@public.gmane.org>
2011-03-16 19:17                                       ` Or Gerlitz
2011-01-20 10:18           ` Or Gerlitz
     [not found]             ` <4D380BDC.5080002-smomgflXvOZWk0Htik3J/w@public.gmane.org>
2011-01-20 12:25               ` David Dillow
2011-03-22  8:01   ` Or Gerlitz
     [not found]     ` <4D885748.6060600-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2011-03-22 12:51       ` David Dillow

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.