On 05/17/2016 10:19 AM, Christoph Lameter wrote: > > On Mon, 16 May 2016, Doug Ledford wrote: >> >> Thanks, this looks good now. When the other two patches come through > > The patch can stand on its own and there has been the expectation > expressed by Mellanox that they want to see this merged first. Guess this > is to reduce the amount of rewrite they would have to do if things change. > Then also the team from Mellanox can directly merge the driver changes > without my involvement. > OK. There are comments from Jason outstanding, and I found one thing that I missed in my earlier reviews. I think we need to refactor how we pull out the stats, or at least consider doing so. In particular, look at how many stats the cxgb3 driver fills in: + stats->dirname = "iw_stats"; + stats->name = names; + + stats->value[IPINRECEIVES] = ((u64)m.ipInReceive_hi << 32) + m.ipInReceive_lo; + stats->value[IPINHDRERRORS] = ((u64)m.ipInHdrErrors_hi << 32) + m.ipInHdrErrors_lo; + stats->value[IPINADDRERRORS] = ((u64)m.ipInAddrErrors_hi << 32) + m.ipInAddrErrors_lo; + stats->value[IPINUNKNOWNPROTOS] = ((u64)m.ipInUnknownProtos_hi << 32) + m.ipInUnknownProtos_lo; + stats->value[IPINDISCARDS] = ((u64)m.ipInDiscards_hi << 32) + m.ipInDiscards_lo; + stats->value[IPINDELIVERS] = ((u64)m.ipInDelivers_hi << 32) + m.ipInDelivers_lo; + stats->value[IPOUTREQUESTS] = ((u64)m.ipOutRequests_hi << 32) + m.ipOutRequests_lo; + stats->value[IPOUTDISCARDS] = ((u64)m.ipOutDiscards_hi << 32) + m.ipOutDiscards_lo; + stats->value[IPOUTNOROUTES] = ((u64)m.ipOutNoRoutes_hi << 32) + m.ipOutNoRoutes_lo; + stats->value[IPREASMTIMEOUT] = m.ipReasmTimeout; + stats->value[IPREASMREQDS] = m.ipReasmReqds; + stats->value[IPREASMOKS] = m.ipReasmOKs; + stats->value[IPREASMFAILS] = m.ipReasmFails; + stats->value[TCPACTIVEOPENS] = m.tcpActiveOpens; + stats->value[TCPPASSIVEOPENS] = m.tcpPassiveOpens; + stats->value[TCPATTEMPTFAILS] = m.tcpAttemptFails; + stats->value[TCPESTABRESETS] = m.tcpEstabResets; + stats->value[TCPCURRESTAB] = m.tcpOutRsts; + stats->value[TCPINSEGS] = m.tcpCurrEstab; + stats->value[TCPOUTSEGS] = ((u64)m.tcpInSegs_hi << 32) + m.tcpInSegs_lo; + stats->value[TCPRETRANSSEGS] = ((u64)m.tcpOutSegs_hi << 32) + m.tcpOutSegs_lo; + stats->value[TCPINERRS] = ((u64)m.tcpRetransSeg_hi << 32) + m.tcpRetransSeg_lo, + stats->value[TCPOUTRSTS] = ((u64)m.tcpInErrs_hi << 32) + m.tcpInErrs_lo; + stats->value[TCPRTOMIN] = m.tcpRtoMin; + stats->value[TCPRTOMAX] = m.tcpRtoMax; That's a lot of copies, and shifts, and everything else. Then look at what it does to get them: ret = dev->rdev.t3cdev_p->ctl(dev->rdev.t3cdev_p, RDMA_GET_MIB, &m); I didn't dig too deep, but that looks suspiciously like it might be an actual mailbox command to the card. That can be rather expensive. Then look at how we get the stats to print them to user space: +static ssize_t show_protocol_stats(struct ib_device *dev, int index, + u8 port, char *buf) +{ + struct rdma_protocol_stats stats = {0}; + ssize_t ret; + + ret = dev->get_protocol_stats(dev, &stats, port); + if (ret) + return ret; + + return sprintf(buf, "%llu\n", stats.value[index]); +} In a nutshell, we go through the effort of a suspected mailbox command, then we fill in all of the stats including all of the copies and shifts and everything else, then we print out precisely one and only one stat before we throw the rest of them away. If someone goes into the stats directory for a card and does cat * or for i in *; do echo -ne "$i:\t"; cat $i; done, then we will issue 25 mailbox commands, and fill out all 25 stats structs 25 times, just to print out one complete set of stats. For cxgb4 this isn't so bad, it's only got 4 items. But the longer the list gets, the worst this is because it makes our efficiency of operation O(n^2). Since we can't break out mailbox commands to only provide part of the data, I think we need to consider using a cached struct for each device. If the cached data is less than a certain age on subsequent reads, we use the cached data. If it's too old, we discard it and get new data. -- Doug Ledford GPG KeyID: 0E572FDD