From mboxrd@z Thu Jan 1 00:00:00 1970 From: Albert Chu Subject: [opensm] RFC: new routing options (repost) Date: Thu, 10 Feb 2011 17:33:34 -0800 Message-ID: <1297388014.18394.302.camel@auk59.llnl.gov> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="=-YgQUXK6nvWhElX+ynxH2" Return-path: Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" List-Id: linux-rdma@vger.kernel.org --=-YgQUXK6nvWhElX+ynxH2 Content-Type: text/plain Content-Transfer-Encoding: 7bit [This is a repost from Oct 2010 with rebased patches] We recently got a new cluster and I've been experimenting with some routing changes to improve the average bandwidth of the cluster. They are attached as patches with description of the routing goals below. We're using mpiGraph (http://sourceforge.net/projects/mpigraph/) to measure min, peak, and average send/recv bandwidth across the cluster. What we found with the original updn routing was an average of around 420 MB/s send bandwidth and 508 MB/s recv bandwidth. The following two patches were able to get the average send bandwidth up to 1045 MB/s and recv bandwidth up to 1228 MB/s. I'm sure this is only round 1 of the patches and I'm looking for comments. Many areas could be cleaned up w/ some rearchitecture, but I elected to implement the most non-invasive implementation first. I'm also open to name changes on the options. 1) Port Shifting This is similar to what was done with some of the LMC > 0 code. Congestion would occur due to "alignment" of routes w/ common traffic patterns. However, we found that it was also necessary for LMC=0 and only for used-ports. For example, lets say there are 4 ports (called A, B, C, D) and we are routing lids 1-9 through them. Suppose only routing through A, B, and C will reach lids 1-9. The LFT would normally be: A: 1 4 7 B: 2 5 8 C: 3 6 9 D: The Port Shifting option would make this: A: 1 6 8 B: 2 4 9 C: 3 5 7 D: This option by itself improved the mpiGraph average send/recv bandwidth from 420 MB/s and 508 MB/s to to 991 MB/s and 1172 MB/s. 2) Remote Guid Sorting Most core/spine switches we've seen thus far have had line boards connected to spine boards in a consistent pattern. However, we recently got some Qlogic switches that connect from line/leaf boards to spine boards in a (to the casual observer) random pattern. I'm sure there was a good electrical/board reason for this design, but it does hurt routing b/c updn doesn't account for this. Here's an output from iblinkinfo as an example. Switch 0x00066a00ec0029b8 ibcore1 L123: 180 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 254 19[ ] "ibsw55" ( ) 180 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 253 19[ ] "ibsw56" ( ) 180 3[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 258 19[ ] "ibsw57" ( ) 180 4[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 257 19[ ] "ibsw58" ( ) 180 5[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 256 19[ ] "ibsw59" ( ) 180 6[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 255 19[ ] "ibsw60" ( ) 180 7[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 261 19[ ] "ibsw61" ( ) 180 8[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 262 19[ ] "ibsw62" ( ) 180 9[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 260 19[ ] "ibsw63" ( ) 180 10[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 259 19[ ] "ibsw64" ( ) 180 11[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 284 19[ ] "ibsw65" ( ) 180 12[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 285 19[ ] "ibsw66" ( ) 180 13[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2227 19[ ] "ibsw67" ( ) 180 14[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 283 19[ ] "ibsw68" ( ) 180 15[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 267 19[ ] "ibsw69" ( ) 180 16[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 270 19[ ] "ibsw70" ( ) 180 17[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 269 19[ ] "ibsw71" ( ) 180 18[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 268 19[ ] "ibsw72" ( ) 180 19[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 222 17[ ] "ibcore1 S117B" ( ) 180 20[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 209 19[ ] "ibcore1 S211B" ( ) 180 21[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 218 21[ ] "ibcore1 S117A" ( ) 180 22[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 192 23[ ] "ibcore1 S215B" ( ) 180 23[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 85 15[ ] "ibcore1 S209A" ( ) 180 24[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 182 13[ ] "ibcore1 S215A" ( ) 180 25[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 200 11[ ] "ibcore1 S115B" ( ) 180 26[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 129 25[ ] "ibcore1 S209B" ( ) 180 27[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 213 27[ ] "ibcore1 S115A" ( ) 180 28[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 197 29[ ] "ibcore1 S213B" ( ) 180 29[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 178 28[ ] "ibcore1 S111A" ( ) 180 30[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 215 7[ ] "ibcore1 S213A" ( ) 180 31[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 207 5[ ] "ibcore1 S113B" ( ) 180 32[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 212 6[ ] "ibcore1 S211A" ( ) 180 33[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 154 33[ ] "ibcore1 S113A" ( ) 180 34[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 194 35[ ] "ibcore1 S217B" ( ) 180 35[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 191 3[ ] "ibcore1 S111B" ( ) 180 36[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 219 1[ ] "ibcore1 S217A" ( ) This is a line board that connects up to spine boards (ibcore1 S* switches) and down to leaf/edge switches (ibsw*). As you can see the line board connects to the ports on the edge switches in a consistent fashion (always port 19), but connects to the spine switches in a (to the casual observer) random fashion (port 17, 19, 21, 23, 15, ...). The "remote_guid_sorting" option will slightly tweak routing so that instead of finding a port to route through by searching ports 1 to N. It will (effectively) sort the ports based on remote connected node guid, then pick a port searching from lowest guid to highest guid. That way the routing calculations across each line/leaf board and spine switch will be consistent. This patch (on top of the port_shifting one above) improved the mpiGraph average send/recv bandwidth from 991 MB/s & 1172 MB/s to 1045 MB/s and 1228 MB/s. Al -- Albert Chu chu11-i2BcT+NCU+M@public.gmane.org Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory --=-YgQUXK6nvWhElX+ynxH2 Content-Disposition: attachment; filename=0001-Support-port-shifting.patch Content-Type: message/rfc822; name=0001-Support-port-shifting.patch From: Albert L. Chu Date: Mon, 7 Feb 2011 16:52:41 -0800 Subject: [PATCH] Support port shifting Message-Id: <1297379237.18394.290.camel-akkeaxHeDKRliZ7u+bvwcg@public.gmane.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Signed-off-by: Albert L. Chu --- include/opensm/osm_subnet.h | 4 ++ include/opensm/osm_switch.h | 6 ++- man/opensm.8.in | 8 ++++ opensm/main.c | 8 ++++ opensm/osm_dump.c | 2 +- opensm/osm_subnet.c | 7 +++ opensm/osm_switch.c | 98 ++++++++++++++++++++++++++++++++++++++++++- opensm/osm_ucast_mgr.c | 3 +- 8 files changed, 132 insertions(+), 4 deletions(-) diff --git a/include/opensm/osm_subnet.h b/include/opensm/osm_subnet.h index 42ae416..59f877e 100644 --- a/include/opensm/osm_subnet.h +++ b/include/opensm/osm_subnet.h @@ -199,6 +199,7 @@ typedef struct osm_subn_opt { char *root_guid_file; char *cn_guid_file; char *io_guid_file; + boolean_t port_shifting; uint16_t max_reverse_hops; char *ids_guid_file; char *guid_routing_order_file; @@ -418,6 +419,9 @@ typedef struct osm_subn_opt { * Name of the file that contains list of I/O node guids that * will be used by fat-tree routing (provided by User) * +* port_shifting +* This option will turn on port_shifting in routing. +* * ids_guid_file * Name of the file that contains list of ids which should be * used by Up/Down algorithm instead of node GUIDs diff --git a/include/opensm/osm_switch.h b/include/opensm/osm_switch.h index f407dd9..8eae119 100644 --- a/include/opensm/osm_switch.h +++ b/include/opensm/osm_switch.h @@ -919,7 +919,8 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, IN unsigned start_from, IN boolean_t ignore_existing, IN boolean_t routing_for_lmc, - IN boolean_t dor); + IN boolean_t dor, + IN boolean_t port_shifting); /* * PARAMETERS * p_sw @@ -955,6 +956,9 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, * dor * [in] If TRUE, Dimension Order Routing will be done. * +* port_shifting +* [in] If TRUE, port_shifting will be done. +* * RETURN VALUE * Returns the recommended port on which to route this LID. * diff --git a/man/opensm.8.in b/man/opensm.8.in index cd3a24f..db48d52 100644 --- a/man/opensm.8.in +++ b/man/opensm.8.in @@ -25,6 +25,7 @@ opensm \- InfiniBand subnet manager and administration (SM/SA) [\-a | \-\-root_guid_file ] [\-u | \-\-cn_guid_file ] [\-G | \-\-io_guid_file ] +[\-\-port\-shifting] [\-H | \-\-max_reverse_hops ] [\-X | \-\-guid_routing_order_file ] [\-m | \-\-ids_guid_file ] @@ -208,6 +209,13 @@ to the guids provided in the given file (one to a line). I/O nodes are non-CN nodes allowed to use up to max_reverse_hops switches the wrong way around to improve connectivity. .TP +\fB\-\-port\-shifting\fR +This option enables a feature called \fBport shifting\fR. In some +fabrics, particularly cluster environments, routes commonly align and +congest with other routes due to algorithmically unchanging traffic +patterns. This routing option will "shift" routing around in an +attempt to alleviate this problem. +.TP \fB\-H\fR, \fB\-\-max_reverse_hops\fR Set the maximum number of reverse hops an I/O node is allowed to make. A reverse hop is the use of a switch the wrong way around. diff --git a/opensm/main.c b/opensm/main.c index 756fe6f..abb32ec 100644 --- a/opensm/main.c +++ b/opensm/main.c @@ -223,6 +223,9 @@ static void show_usage(void) printf("--io_guid_file, -G \n" " Set the I/O nodes for the Fat-Tree routing algorithm\n" " to the guids provided in the given file (one to a line)\n\n"); + printf("--port-shifting\n" + " Attempt to shift port routes around to remove alignment problems\n" + " in routing tables\n\n"); printf("--max_reverse_hops, -H \n" " Set the max number of hops the wrong way around\n" " an I/O node is allowed to do (connectivity for I/O nodes on top swithces)\n\n"); @@ -601,6 +604,7 @@ int main(int argc, char *argv[]) {"root_guid_file", 1, NULL, 'a'}, {"cn_guid_file", 1, NULL, 'u'}, {"io_guid_file", 1, NULL, 'G'}, + {"port-shifting", 0, NULL, 11}, {"max_reverse_hops", 1, NULL, 'H'}, {"ids_guid_file", 1, NULL, 'm'}, {"guid_routing_order_file", 1, NULL, 'X'}, @@ -937,6 +941,10 @@ int main(int argc, char *argv[]) opt.io_guid_file = optarg; printf(" I/O Node Guid File: %s\n", opt.io_guid_file); break; + case 11: + opt.port_shifting = TRUE; + printf(" Port Shifting is on\n"); + break; case 'H': opt.max_reverse_hops = atoi(optarg); printf(" Max Reverse Hops: %d\n", opt.max_reverse_hops); diff --git a/opensm/osm_dump.c b/opensm/osm_dump.c index 535a03f..a1ff168 100644 --- a/opensm/osm_dump.c +++ b/opensm/osm_dump.c @@ -221,7 +221,7 @@ static void dump_ucast_routes(cl_map_item_t * item, FILE * file, void *cxt) /* No LMC Optimization */ best_port = osm_switch_recommend_path(p_sw, p_port, lid_ho, 1, TRUE, - FALSE, dor); + FALSE, dor, FALSE); fprintf(file, "No %u hop path possible via port %u!", best_hops, best_port); } diff --git a/opensm/osm_subnet.c b/opensm/osm_subnet.c index 228418f..c62192c 100644 --- a/opensm/osm_subnet.c +++ b/opensm/osm_subnet.c @@ -347,6 +347,7 @@ static const opt_rec_t opt_tbl[] = { { "root_guid_file", OPT_OFFSET(root_guid_file), opts_parse_charp, NULL, 0 }, { "cn_guid_file", OPT_OFFSET(cn_guid_file), opts_parse_charp, NULL, 0 }, { "io_guid_file", OPT_OFFSET(io_guid_file), opts_parse_charp, NULL, 0 }, + { "port_shifting", OPT_OFFSET(port_shifting), opts_parse_boolean, NULL, 1 }, { "max_reverse_hops", OPT_OFFSET(max_reverse_hops), opts_parse_uint16, NULL, 0 }, { "ids_guid_file", OPT_OFFSET(ids_guid_file), opts_parse_charp, NULL, 0 }, { "guid_routing_order_file", OPT_OFFSET(guid_routing_order_file), opts_parse_charp, NULL, 0 }, @@ -740,6 +741,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * p_opt) p_opt->root_guid_file = NULL; p_opt->cn_guid_file = NULL; p_opt->io_guid_file = NULL; + p_opt->port_shifting = FALSE; p_opt->max_reverse_hops = 0; p_opt->ids_guid_file = NULL; p_opt->guid_routing_order_file = NULL; @@ -1440,6 +1442,11 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t * p_opts) p_opts->lash_start_vl); fprintf(out, + "# Port Shifting (use FALSE if unsure)\n" + "port_shifting %s\n\n", + p_opts->port_shifting ? "TRUE" : "FALSE"); + + fprintf(out, "# SA database file name\nsa_db_file %s\n\n", p_opts->sa_db_file ? p_opts->sa_db_file : null_str); diff --git a/opensm/osm_switch.c b/opensm/osm_switch.c index 9785a9d..f24d9ea 100644 --- a/opensm/osm_switch.c +++ b/opensm/osm_switch.c @@ -51,6 +51,14 @@ #include #include +struct switch_port_path { + uint8_t port_num; + uint32_t path_count; + int found_sys_guid; + int found_node_guid; + uint32_t forwarded_to; +}; + cl_status_t osm_switch_set_hops(IN osm_switch_t * p_sw, IN uint16_t lid_ho, IN uint8_t port_num, IN uint8_t num_hops) { @@ -217,7 +225,8 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, IN unsigned start_from, IN boolean_t ignore_existing, IN boolean_t routing_for_lmc, - IN boolean_t dor) + IN boolean_t dor, + IN boolean_t port_shifting) { /* We support an enhanced LMC aware routing mode: @@ -259,6 +268,11 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, osm_node_t *p_rem_node_first = NULL; struct osm_remote_node *p_remote_guid = NULL; struct osm_remote_node null_remote_node = {NULL, 0, 0}; + struct switch_port_path port_paths[IB_NODE_NUM_PORTS_MAX]; + unsigned int port_paths_total_paths = 0; + unsigned int port_paths_count = 0; + int found_sys_guid; + int found_node_guid; CL_ASSERT(lid_ho > 0); @@ -369,6 +383,7 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, check_count = osm_port_prof_path_count_get(&p_sw->p_prof[port_num]); + if (dor) { /* Get the Remote Node */ p_rem_physp = osm_physp_get_remote(p_physp); @@ -412,7 +427,10 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, best_port_other_sys = port_num; least_forwarded_to = 0; } + found_sys_guid = 0; } else { /* same sys found - try node */ + + /* Else is the node guid already used ? */ p_remote_guid = switch_find_node_guid_count(p_sw, p_port->priv, @@ -427,9 +445,27 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, } /* else prior sys and node guid already used */ + if (!p_remote_guid) + found_node_guid = 0; + else + found_node_guid = 1; + found_sys_guid = 1; } /* same sys found */ } + port_paths[port_paths_count].port_num = port_num; + port_paths[port_paths_count].path_count = check_count; + if (routing_for_lmc) { + port_paths[port_paths_count].found_sys_guid = found_sys_guid; + port_paths[port_paths_count].found_node_guid = found_node_guid; + } + if (routing_for_lmc && p_remote_guid) + port_paths[port_paths_count].forwarded_to = p_remote_guid->forwarded_to; + else + port_paths[port_paths_count].forwarded_to = 0; + port_paths_total_paths += check_count; + port_paths_count++; + /* routing for LMC mode */ /* the count is min but also lower then the max subscribed @@ -454,6 +490,66 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, if (port_found == FALSE) return OSM_NO_PATH; + if (port_shifting && port_paths_count) { + /* In the port_paths[] array, we now have all the ports that we + * can route out of. Using some shifting math below, possibly + * select a different one so that lids won't align in LFTs + * + * If lmc > 0, we need to loop through these ports to find the + * least_forwarded_to port, best_port_other_sys, and + * best_port_other_node just like before but through the different + * ordering. + */ + + least_paths = 0xFFFFFFFF; + least_paths_other_sys = 0xFFFFFFFF; + least_paths_other_nodes = 0xFFFFFFFF; + least_forwarded_to = 0xFFFFFFFF; + best_port = 0; + best_port_other_sys = 0; + best_port_other_node = 0; + + for (i = 0; i < port_paths_count; i++) { + unsigned int idx; + + idx = (port_paths_total_paths/port_paths_count + i) % port_paths_count; + + if (routing_for_lmc) { + if (!port_paths[idx].found_sys_guid + && port_paths[idx].path_count < least_paths_other_sys) { + least_paths_other_sys = port_paths[idx].path_count; + best_port_other_sys = port_paths[idx].port_num; + least_forwarded_to = 0; + } + else if (!port_paths[idx].found_node_guid + && port_paths[idx].path_count < least_paths_other_nodes) { + least_paths_other_nodes = port_paths[idx].path_count; + best_port_other_node = port_paths[idx].port_num; + least_forwarded_to = 0; + } + } + + if (port_paths[idx].path_count < least_paths) { + best_port = port_paths[idx].port_num; + least_paths = port_paths[idx].path_count; + if (routing_for_lmc + && (port_paths[idx].found_sys_guid + || port_paths[idx].found_node_guid) + && port_paths[idx].forwarded_to < least_forwarded_to) + least_forwarded_to = port_paths[idx].forwarded_to; + } + else if (routing_for_lmc + && (port_paths[idx].found_sys_guid + || port_paths[idx].found_node_guid) + && port_paths[idx].path_count == least_paths + && port_paths[idx].forwarded_to < least_forwarded_to) { + least_forwarded_to = port_paths[idx].forwarded_to; + best_port = port_paths[idx].port_num; + } + + } + } + /* if we are in enhanced routing mode and the best port is not the local port 0 diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c index 4019589..d32eb60 100644 --- a/opensm/osm_ucast_mgr.c +++ b/opensm/osm_ucast_mgr.c @@ -255,7 +255,8 @@ static void ucast_mgr_process_port(IN osm_ucast_mgr_t * p_mgr, port = osm_switch_recommend_path(p_sw, p_port, lid_ho, start_from, p_mgr->p_subn->ignore_existing_lfts, p_mgr->p_subn->opt.lmc, - p_mgr->is_dor); + p_mgr->is_dor, + p_mgr->p_subn->opt.port_shifting); if (port == OSM_NO_PATH) { /* do not try to overwrite the ppro of non existing port ... */ -- 1.5.4.5 --=-YgQUXK6nvWhElX+ynxH2 Content-Disposition: attachment; filename=0002-Support-remote-guid-sorting.patch Content-Type: message/rfc822; name=0002-Support-remote-guid-sorting.patch From: Albert L. Chu Date: Mon, 7 Feb 2011 16:53:39 -0800 Subject: [PATCH] Support remote guid sorting Message-Id: <1297379237.18394.291.camel-akkeaxHeDKRliZ7u+bvwcg@public.gmane.org> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Signed-off-by: Albert L. Chu --- include/opensm/osm_subnet.h | 4 ++++ include/opensm/osm_switch.h | 6 +++++- man/opensm.8.in | 6 ++++++ opensm/main.c | 8 ++++++++ opensm/osm_dump.c | 3 ++- opensm/osm_subnet.c | 7 +++++++ opensm/osm_switch.c | 26 +++++++++++++++++++++++++- opensm/osm_ucast_mgr.c | 3 ++- 8 files changed, 59 insertions(+), 4 deletions(-) diff --git a/include/opensm/osm_subnet.h b/include/opensm/osm_subnet.h index 59f877e..589e96c 100644 --- a/include/opensm/osm_subnet.h +++ b/include/opensm/osm_subnet.h @@ -200,6 +200,7 @@ typedef struct osm_subn_opt { char *cn_guid_file; char *io_guid_file; boolean_t port_shifting; + boolean_t remote_guid_sorting; uint16_t max_reverse_hops; char *ids_guid_file; char *guid_routing_order_file; @@ -422,6 +423,9 @@ typedef struct osm_subn_opt { * port_shifting * This option will turn on port_shifting in routing. * +* remote_guid_sorting +* This option will turn on remote_guid_sorting in routing. +* * ids_guid_file * Name of the file that contains list of ids which should be * used by Up/Down algorithm instead of node GUIDs diff --git a/include/opensm/osm_switch.h b/include/opensm/osm_switch.h index 8eae119..aef45cb 100644 --- a/include/opensm/osm_switch.h +++ b/include/opensm/osm_switch.h @@ -920,7 +920,8 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, IN boolean_t ignore_existing, IN boolean_t routing_for_lmc, IN boolean_t dor, - IN boolean_t port_shifting); + IN boolean_t port_shifting, + IN boolean_t remote_guid_sorting); /* * PARAMETERS * p_sw @@ -959,6 +960,9 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, * port_shifting * [in] If TRUE, port_shifting will be done. * +* remote_guid_sorting +* [in] If TRUE, remote_guid_sorting will be done. +* * RETURN VALUE * Returns the recommended port on which to route this LID. * diff --git a/man/opensm.8.in b/man/opensm.8.in index db48d52..decaee7 100644 --- a/man/opensm.8.in +++ b/man/opensm.8.in @@ -216,6 +216,12 @@ congest with other routes due to algorithmically unchanging traffic patterns. This routing option will "shift" routing around in an attempt to alleviate this problem. .TP +\fB\-\-remote\-guid\-sorting\fR +This option enables a feature called \fBremote guid sorting\fR. In some +fabrics, switches may be cabled in an inconsistent fashion. This option +may alleviate those issues by sorting remote guids before routing, +making remote destinations appear to be ordered consistently. +.TP \fB\-H\fR, \fB\-\-max_reverse_hops\fR Set the maximum number of reverse hops an I/O node is allowed to make. A reverse hop is the use of a switch the wrong way around. diff --git a/opensm/main.c b/opensm/main.c index abb32ec..91ae940 100644 --- a/opensm/main.c +++ b/opensm/main.c @@ -226,6 +226,9 @@ static void show_usage(void) printf("--port-shifting\n" " Attempt to shift port routes around to remove alignment problems\n" " in routing tables\n\n"); + printf("--remote-guid-sorting\n" + " Sort ports by remote port guid before routing to alleviate\n" + " problems with inconsistent cabling across a fabric\n\n"); printf("--max_reverse_hops, -H \n" " Set the max number of hops the wrong way around\n" " an I/O node is allowed to do (connectivity for I/O nodes on top swithces)\n\n"); @@ -605,6 +608,7 @@ int main(int argc, char *argv[]) {"cn_guid_file", 1, NULL, 'u'}, {"io_guid_file", 1, NULL, 'G'}, {"port-shifting", 0, NULL, 11}, + {"remote-guid-sorting", 0, NULL, 13}, {"max_reverse_hops", 1, NULL, 'H'}, {"ids_guid_file", 1, NULL, 'm'}, {"guid_routing_order_file", 1, NULL, 'X'}, @@ -945,6 +949,10 @@ int main(int argc, char *argv[]) opt.port_shifting = TRUE; printf(" Port Shifting is on\n"); break; + case 13: + opt.remote_guid_sorting = TRUE; + printf(" Remote Guid Sorting is on\n"); + break; case 'H': opt.max_reverse_hops = atoi(optarg); printf(" Max Reverse Hops: %d\n", opt.max_reverse_hops); diff --git a/opensm/osm_dump.c b/opensm/osm_dump.c index a1ff168..bfe63c3 100644 --- a/opensm/osm_dump.c +++ b/opensm/osm_dump.c @@ -221,7 +221,8 @@ static void dump_ucast_routes(cl_map_item_t * item, FILE * file, void *cxt) /* No LMC Optimization */ best_port = osm_switch_recommend_path(p_sw, p_port, lid_ho, 1, TRUE, - FALSE, dor, FALSE); + FALSE, dor, FALSE, + FALSE); fprintf(file, "No %u hop path possible via port %u!", best_hops, best_port); } diff --git a/opensm/osm_subnet.c b/opensm/osm_subnet.c index c62192c..b2b219f 100644 --- a/opensm/osm_subnet.c +++ b/opensm/osm_subnet.c @@ -348,6 +348,7 @@ static const opt_rec_t opt_tbl[] = { { "cn_guid_file", OPT_OFFSET(cn_guid_file), opts_parse_charp, NULL, 0 }, { "io_guid_file", OPT_OFFSET(io_guid_file), opts_parse_charp, NULL, 0 }, { "port_shifting", OPT_OFFSET(port_shifting), opts_parse_boolean, NULL, 1 }, + { "remote_guid_sorting", OPT_OFFSET(remote_guid_sorting), opts_parse_boolean, NULL, 1 }, { "max_reverse_hops", OPT_OFFSET(max_reverse_hops), opts_parse_uint16, NULL, 0 }, { "ids_guid_file", OPT_OFFSET(ids_guid_file), opts_parse_charp, NULL, 0 }, { "guid_routing_order_file", OPT_OFFSET(guid_routing_order_file), opts_parse_charp, NULL, 0 }, @@ -742,6 +743,7 @@ void osm_subn_set_default_opt(IN osm_subn_opt_t * p_opt) p_opt->cn_guid_file = NULL; p_opt->io_guid_file = NULL; p_opt->port_shifting = FALSE; + p_opt->remote_guid_sorting = FALSE; p_opt->max_reverse_hops = 0; p_opt->ids_guid_file = NULL; p_opt->guid_routing_order_file = NULL; @@ -1447,6 +1449,11 @@ int osm_subn_output_conf(FILE *out, IN osm_subn_opt_t * p_opts) p_opts->port_shifting ? "TRUE" : "FALSE"); fprintf(out, + "# Remote Guid Sorting (use FALSE if unsure)\n" + "remote_guid_sorting %s\n\n", + p_opts->remote_guid_sorting ? "TRUE" : "FALSE"); + + fprintf(out, "# SA database file name\nsa_db_file %s\n\n", p_opts->sa_db_file ? p_opts->sa_db_file : null_str); diff --git a/opensm/osm_switch.c b/opensm/osm_switch.c index f24d9ea..0aa0137 100644 --- a/opensm/osm_switch.c +++ b/opensm/osm_switch.c @@ -57,6 +57,7 @@ struct switch_port_path { int found_sys_guid; int found_node_guid; uint32_t forwarded_to; + uint64_t remote_node_guid; }; cl_status_t osm_switch_set_hops(IN osm_switch_t * p_sw, IN uint16_t lid_ho, @@ -169,6 +170,19 @@ boolean_t osm_switch_get_lft_block(IN const osm_switch_t * p_sw, return TRUE; } +static int +port_path_guid_cmp(IN const void *x, IN const void *y) +{ + struct switch_port_path *a = (struct switch_port_path *)x; + struct switch_port_path *b = (struct switch_port_path *)y; + + if (a->remote_node_guid < b->remote_node_guid) + return -1; + if (a->remote_node_guid > b->remote_node_guid) + return 1; + return 0; +} + static struct osm_remote_node * switch_find_guid_common(IN const osm_switch_t * p_sw, IN struct osm_remote_guids_count *r, @@ -226,7 +240,8 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, IN boolean_t ignore_existing, IN boolean_t routing_for_lmc, IN boolean_t dor, - IN boolean_t port_shifting) + IN boolean_t port_shifting, + IN boolean_t remote_guid_sorting) { /* We support an enhanced LMC aware routing mode: @@ -428,6 +443,7 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, least_forwarded_to = 0; } found_sys_guid = 0; + found_node_guid = 0; } else { /* same sys found - try node */ @@ -463,6 +479,9 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, port_paths[port_paths_count].forwarded_to = p_remote_guid->forwarded_to; else port_paths[port_paths_count].forwarded_to = 0; + p_rem_physp = osm_physp_get_remote(p_physp); + p_rem_node = osm_physp_get_node_ptr(p_rem_physp); + port_paths[port_paths_count].remote_node_guid = p_rem_node->node_info.node_guid; port_paths_total_paths += check_count; port_paths_count++; @@ -490,6 +509,11 @@ uint8_t osm_switch_recommend_path(IN const osm_switch_t * p_sw, if (port_found == FALSE) return OSM_NO_PATH; + if (remote_guid_sorting && port_paths_count) { + qsort(port_paths, port_paths_count, sizeof(struct switch_port_path), + port_path_guid_cmp); + } + if (port_shifting && port_paths_count) { /* In the port_paths[] array, we now have all the ports that we * can route out of. Using some shifting math below, possibly diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c index d32eb60..a8982df 100644 --- a/opensm/osm_ucast_mgr.c +++ b/opensm/osm_ucast_mgr.c @@ -256,7 +256,8 @@ static void ucast_mgr_process_port(IN osm_ucast_mgr_t * p_mgr, p_mgr->p_subn->ignore_existing_lfts, p_mgr->p_subn->opt.lmc, p_mgr->is_dor, - p_mgr->p_subn->opt.port_shifting); + p_mgr->p_subn->opt.port_shifting, + p_mgr->p_subn->opt.remote_guid_sorting); if (port == OSM_NO_PATH) { /* do not try to overwrite the ppro of non existing port ... */ -- 1.5.4.5 --=-YgQUXK6nvWhElX+ynxH2-- -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html