From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jerin Jacob Subject: Re: [PATCH 1/3] examples/eventdev_pipeline: added sample app Date: Thu, 29 Jun 2017 12:47:28 +0530 Message-ID: <20170629071726.GC5893@jerin> References: <1492768299-84016-1-git-send-email-harry.van.haaren@intel.com> <1492768299-84016-2-git-send-email-harry.van.haaren@intel.com> <20170510141202.GA8431@jerin> <6c53f05d-2dd2-5b83-2eab-bcecd93bea82@intel.com> <20170627093506.GB14276@jerin> <72d1d150-a119-2b23-642c-484ca658c4b3@intel.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Harry van Haaren , dev@dpdk.org, Gage Eads , Bruce Richardson To: "Hunt, David" Return-path: Received: from NAM03-DM3-obe.outbound.protection.outlook.com (mail-dm3nam03on0063.outbound.protection.outlook.com [104.47.41.63]) by dpdk.org (Postfix) with ESMTP id 64F022BE1 for ; Thu, 29 Jun 2017 09:18:28 +0200 (CEST) Content-Disposition: inline In-Reply-To: <72d1d150-a119-2b23-642c-484ca658c4b3@intel.com> List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" -----Original Message----- > Date: Tue, 27 Jun 2017 14:12:20 +0100 > From: "Hunt, David" > To: Jerin Jacob > CC: Harry van Haaren , dev@dpdk.org, Gage Eads > , Bruce Richardson > Subject: Re: [dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added > sample app > User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 > Thunderbird/45.8.0 > > Hi Jerin: > > > On 27/6/2017 10:35 AM, Jerin Jacob wrote: > > -----Original Message----- > > > Date: Mon, 26 Jun 2017 15:46:47 +0100 > > > From: "Hunt, David" > > > To: Jerin Jacob , Harry van Haaren > > > > > > CC: dev@dpdk.org, Gage Eads , Bruce Richardson > > > > > > Subject: Re: [dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added > > > sample app > > > User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 > > > Thunderbird/45.8.0 > > > > > > Hi Jerin, > > Hi David, > > > > Looks like you have sent the old version. The below mentioned comments > > are not addressed in v2. > > Oops. Glitch in the Matrix. I've just pushed a V3 with the changes. > > > > I'm assisting Harry on the sample app, and have just pushed up a V2 patch > > > based on your feedback. I've addressed most of your suggestions, comments > > > below. There may still a couple of outstanding questions that need further > > > discussion. > > A few general comments: > > 1) Nikhil/Gage's proposal on ethdev rx to eventdev adapter will change the major > > portion(eventdev setup and producer()) of this application > > 2) Producing one lcore worth of packets, really cant show as example > > eventdev application as it will be pretty bad in low-end machine. > > At least application infrastructure should not limit. > > > > Considering above points, Should we wait for rx adapter to complete > > first? I would like to show this as real world application to use eventdev. > > > > Thoughts? > > > > On the same note: > > Can we finalize on rx adapter proposal? I can work on v1 of patch and > > common code if Nikhil or Gage don't have bandwidth. Let me know? > > > > last followup: > > http://dpdk.org/ml/archives/dev/2017-June/068776.html > > I had a quick chat with Harry, and wonder if we'd be as well to merge the > app as it is now, and as the new frameworks become available, the app can be > updated to make use of them? I feel it would be better to have something out > there for people to play with than waiting for 17.11. I agree with your concern. How about renaming the test and doc specific to SW PMD and then once we fix the known issues with HW eventdev + ethdev(Rx adapter) integration and then rename the application to generic eventdev. > > Also, if you have bandwidth to patch the app for your desired use cases, > that would be a good contribution. I'd only be guessing for some of it :) > > > > > Regards, > > > Dave > > > > > > On 10/5/2017 3:12 PM, Jerin Jacob wrote: > > > > -----Original Message----- > > > > > Date: Fri, 21 Apr 2017 10:51:37 +0100 > > > > > From: Harry van Haaren > > > > > To: dev@dpdk.org > > > > > CC: jerin.jacob@caviumnetworks.com, Harry van Haaren > > > > > , Gage Eads , Bruce > > > > > Richardson > > > > > Subject: [PATCH 1/3] examples/eventdev_pipeline: added sample app > > > > > X-Mailer: git-send-email 2.7.4 > > > > > > > > > > This commit adds a sample app for the eventdev library. > > > > > The app has been tested with DPDK 17.05-rc2, hence this > > > > > release (or later) is recommended. > > > > > > > > > > The sample app showcases a pipeline processing use-case, > > > > > with event scheduling and processing defined per stage. > > > > > The application recieves traffic as normal, with each > > > > > packet traversing the pipeline. Once the packet has > > > > > been processed by each of the pipeline stages, it is > > > > > transmitted again. > > > > > > > > > > The app provides a framework to utilize cores for a single > > > > > role or multiple roles. Examples of roles are the RX core, > > > > > TX core, Scheduling core (in the case of the event/sw PMD), > > > > > and worker cores. > > > > > > > > > > Various flags are available to configure numbers of stages, > > > > > cycles of work at each stage, type of scheduling, number of > > > > > worker cores, queue depths etc. For a full explaination, > > > > > please refer to the documentation. > > > > > > > > > > Signed-off-by: Gage Eads > > > > > Signed-off-by: Bruce Richardson > > > > > Signed-off-by: Harry van Haaren > > > > Thanks for the example application to share the SW view. > > > > I could make it run on HW after some tweaking(not optimized though) > > > > > > > > [...] > > > > > +#define MAX_NUM_STAGES 8 > > > > > +#define BATCH_SIZE 16 > > > > > +#define MAX_NUM_CORE 64 > > > > How about RTE_MAX_LCORE? > > > Core usage in the sample app is held in a uint64_t. Adding arrays would be > > > possible, but I feel that the extra effort would not give that much benefit. > > > I've left as is for the moment, unless you see any strong requirement to go > > > beyond 64 cores? > > I think, it is OK. Again with service core infrastructure this will change. > > > > > > > + > > > > > +static unsigned int active_cores; > > > > > +static unsigned int num_workers; > > > > > +static unsigned long num_packets = (1L << 25); /* do ~32M packets */ > > > > > +static unsigned int num_fids = 512; > > > > > +static unsigned int num_priorities = 1; > > > > looks like its not used. > > > Yes, Removed. > > > > > > > > +static unsigned int num_stages = 1; > > > > > +static unsigned int worker_cq_depth = 16; > > > > > +static int queue_type = RTE_EVENT_QUEUE_CFG_ATOMIC_ONLY; > > > > > +static int16_t next_qid[MAX_NUM_STAGES+1] = {-1}; > > > > > +static int16_t qid[MAX_NUM_STAGES] = {-1}; > > > > Moving all fastpath related variables under a structure with cache > > > > aligned will help. > > > I tried a few different combinations of this, and saw no gains, some losses. > > > So will leave as is for the moment, if that's OK. > > I think, the one are using in fastpath better to allocate from huge page > > using rte_malloc() > > > > > > > +static int worker_cycles; > > > > > +static int enable_queue_priorities; > > > > > + > > > > > +struct prod_data { > > > > > + uint8_t dev_id; > > > > > + uint8_t port_id; > > > > > + int32_t qid; > > > > > + unsigned num_nic_ports; > > > > > +}; > > > Yes, saw a percent or two gain when this plus following two data structs > > > cache aligned. > > looks like it not fixed in v2. Looks like you have sent the old > > version. > > > > > > > + > > > > > + return 0; > > > > > +} > > > > > + > > > > > +static int > > > > > +producer(void) > > > > > +{ > > > > > + static uint8_t eth_port; > > > > > + struct rte_mbuf *mbufs[BATCH_SIZE]; > > > > > + struct rte_event ev[BATCH_SIZE]; > > > > > + uint32_t i, num_ports = prod_data.num_nic_ports; > > > > > + int32_t qid = prod_data.qid; > > > > > + uint8_t dev_id = prod_data.dev_id; > > > > > + uint8_t port_id = prod_data.port_id; > > > > > + uint32_t prio_idx = 0; > > > > > + > > > > > + const uint16_t nb_rx = rte_eth_rx_burst(eth_port, 0, mbufs, > > > BATCH_SIZE); > > > > > + if (++eth_port == num_ports) > > > > > + eth_port = 0; > > > > > + if (nb_rx == 0) { > > > > > + rte_pause(); > > > > > + return 0; > > > > > + } > > > > > + > > > > > + for (i = 0; i < nb_rx; i++) { > > > > > + ev[i].flow_id = mbufs[i]->hash.rss; > > > > prefetching the buff[i+1] may help here? > > > I tried, didn't make much difference. > > OK. > > > > > > > + ev[i].op = RTE_EVENT_OP_NEW; > > > > > + ev[i].sched_type = queue_type; > > > > The value of RTE_EVENT_QUEUE_CFG_ORDERED_ONLY != RTE_SCHED_TYPE_ORDERED. > > > So, we > > > > cannot assign .sched_type as queue_type. > > > > > > > > I think, one option could be to avoid translation in application is to > > > > - Remove RTE_EVENT_QUEUE_CFG_ALL_TYPES, RTE_EVENT_QUEUE_CFG_*_ONLY > > > > - Introduce a new RTE_EVENT_DEV_CAP_ to denote > > > RTE_EVENT_QUEUE_CFG_ALL_TYPES cap > > > > ability > > > > - add sched_type in struct rte_event_queue_conf. If capability flag is > > > > not set then implementation takes sched_type value for the queue. > > > > > > > > Any thoughts? > > > > > > Not sure here, would it be ok for the moment, and we can work on a patch in > > > the future? > > OK > > > > > > > + > > > > > + if (tx_core[lcore_id] && (tx_single || > > > > > + rte_atomic32_cmpset(&tx_lock, 0, 1))) { > > > > > + consumer(); > > > > Should consumer() need to come in this pattern? I am thinking like > > > > if events is from last stage then call consumer() in worker() > > > > > > > > I think, above scheme works better when the _same_ worker code need to run > > > the > > > > case where > > > > 1) ethdev HW is capable to enqueuing the packets to same txq from > > > > multiple thread > > > > 2) ethdev is not capable to do so. > > > > > > > > So, The above cases can be addressed in configuration time where we link > > > > the queues to port > > > > case 1) Link all workers to last queue > > > > case 2) Link only worker to last queue > > > > > > > > and keeping the common worker code. > > > > > > > > HW implementation has functional and performance issue if "two" ports are > > > > assigned to one lcore for dequeue. The above scheme fixes that problem > > > too. > > > > > > > > > Can we have a bit more discussion on this item? Is this needed for this > > > sample app, or can we perhaps work a patch for this later? Harry? > > As explained above, Is there any issue in keeping consumer() for last > > stage ? > > I would probably see this as a future enhancement as per my initial comments > above. Any hardware or new framework additions are welcome as future patches > to the app. > > > > > > > > > + rte_atomic32_clear((rte_atomic32_t *)&tx_lock); > > > > > + } > > > > > +} > > > > > + > > > > > +static int > > > > > +worker(void *arg) > > > > > +{ > > > > > + struct rte_event events[BATCH_SIZE]; > > > > > + > > > > > + struct worker_data *data = (struct worker_data *)arg; > > > > > + uint8_t dev_id = data->dev_id; > > > > > + uint8_t port_id = data->port_id; > > > > > + size_t sent = 0, received = 0; > > > > > + unsigned lcore_id = rte_lcore_id(); > > > > > + > > > > > + while (!done) { > > > > > + uint16_t i; > > > > > + > > > > > + schedule_devices(dev_id, lcore_id); > > > > > + > > > > > + if (!worker_core[lcore_id]) { > > > > > + rte_pause(); > > > > > + continue; > > > > > + } > > > > > + > > > > > + uint16_t nb_rx = rte_event_dequeue_burst(dev_id, port_id, > > > > > + events, RTE_DIM(events), 0); > > > > > + > > > > > + if (nb_rx == 0) { > > > > > + rte_pause(); > > > > > + continue; > > > > > + } > > > > > + received += nb_rx; > > > > > + > > > > > + for (i = 0; i < nb_rx; i++) { > > > > > + struct ether_hdr *eth; > > > > > + struct ether_addr addr; > > > > > + struct rte_mbuf *m = events[i].mbuf; > > > > > + > > > > > + /* The first worker stage does classification */ > > > > > + if (events[i].queue_id == qid[0]) > > > > > + events[i].flow_id = m->hash.rss % num_fids; > > > > Not sure why we need do(shrinking the flows) this in worker() in queue > > > based pipeline. > > > > If an PMD has any specific requirement on num_fids,I think, we > > > > can move this configuration stage or PMD can choose optimum fid internally > > > to > > > > avoid modulus operation tax in fastpath in all PMD. > > > > > > > > Does struct rte_event_queue_conf.nb_atomic_flows help here? > > > In my tests the modulus makes very little difference in the throughput. And > > > I think it's good to have a way of varying the number of flows for testing > > > different scenarios, even if it's not the most performant. > > Not sure. > > > > > > > + > > > > > + events[i].queue_id = next_qid[events[i].queue_id]; > > > > > + events[i].op = RTE_EVENT_OP_FORWARD; > > > > missing events[i].sched_type.HW PMD does not work with this. > > > > I think, we can use similar scheme like next_qid for next_sched_type. > > > Done. added events[i].sched_type = queue_type. > > > > > > > > + > > > > > + /* change mac addresses on packet (to use mbuf data) */ > > > > > + eth = rte_pktmbuf_mtod(m, struct ether_hdr *); > > > > > + ether_addr_copy(ð->d_addr, &addr); > > > > > + ether_addr_copy(ð->s_addr, ð->d_addr); > > > > > + ether_addr_copy(&addr, ð->s_addr); > > > > IMO, We can make packet processing code code as "static inline function" > > > so > > > > different worker types can reuse. > > > Done. moved out to a work() function. > > I think, mac swap should do in last stage, not on each forward. > > ie. With existing code, 2 stage forward makes in original order. > > > > > > > + > > > > > + /* do a number of cycles of work per packet */ > > > > > + volatile uint64_t start_tsc = rte_rdtsc(); > > > > > + while (rte_rdtsc() < start_tsc + worker_cycles) > > > > > + rte_pause(); > > > > Ditto. > > > Done. moved out to a work() function. > > > > > > > I think, All worker specific variables like "worker_cycles" can moved into > > > > one structure and use. > > > > > > > > > + } > > > > > + uint16_t nb_tx = rte_event_enqueue_burst(dev_id, port_id, > > > > > + events, nb_rx); > > > > > + while (nb_tx < nb_rx && !done) > > > > > + nb_tx += rte_event_enqueue_burst(dev_id, port_id, > > > > > + events + nb_tx, > > > > > + nb_rx - nb_tx); > > > > > + sent += nb_tx; > > > > > + } > > > > > + > > > > > + if (!quiet) > > > > > + printf(" worker %u thread done. RX=%zu TX=%zu\n", > > > > > + rte_lcore_id(), received, sent); > > > > > + > > > > > + return 0; > > > > > +} > > > > > + > > > > > +/* > > > > > + * Parse the coremask given as argument (hexadecimal string) and fill > > > > > + * the global configuration (core role and core count) with the parsed > > > > > + * value. > > > > > + */ > > > > > +static int xdigit2val(unsigned char c) > > > > multiple instance of "xdigit2val" in DPDK repo. May be we can push this > > > > as common code. > > > Sure, that's something we can look at in a separate patch, now that it's > > > being used more and more. > > make sense. > > > > > > > +{ > > > > > + int val; > > > > > + > > > > > + if (isdigit(c)) > > > > > + val = c - '0'; > > > > > + else if (isupper(c)) > > > > > + val = c - 'A' + 10; > > > > > + else > > > > > + val = c - 'a' + 10; > > > > > + return val; > > > > > +} > > > > > + > > > > > + > > > > > +static void > > > > > +usage(void) > > > > > +{ > > > > > + const char *usage_str = > > > > > + " Usage: eventdev_demo [options]\n" > > > > > + " Options:\n" > > > > > + " -n, --packets=N Send N packets (default ~32M), 0 > > > implies no limit\n" > > > > > + " -f, --atomic-flows=N Use N random flows from 1 to N > > > (default 16)\n" > > > > I think, this parameter now, effects the application fast path code.I > > > think, > > > > it should eventdev configuration para-mater. > > > See above comment on num_fids > > > > > > > > + " -s, --num_stages=N Use N atomic stages (default > > > 1)\n" > > > > > + " -r, --rx-mask=core mask Run NIC rx on CPUs in core > > > mask\n" > > > > > + " -w, --worker-mask=core mask Run worker on CPUs in core > > > mask\n" > > > > > + " -t, --tx-mask=core mask Run NIC tx on CPUs in core > > > mask\n" > > > > > + " -e --sched-mask=core mask Run scheduler on CPUs in core > > > mask\n" > > > > > + " -c --cq-depth=N Worker CQ depth (default 16)\n" > > > > > + " -W --work-cycles=N Worker cycles (default 0)\n" > > > > > + " -P --queue-priority Enable scheduler queue > > > prioritization\n" > > > > > + " -o, --ordered Use ordered scheduling\n" > > > > > + " -p, --parallel Use parallel scheduling\n" > > > > IMO, all stage being "parallel" or "ordered" or "atomic" is one mode of > > > > operation. It is valid have to any combination. We need to express that in > > > > command like > > > > example: > > > > 3 stage with > > > > O->A->P > > > How about we add an option that specifies the mode of operation for each > > > stage in a string? Maybe have a '-m' option (modes) e.g. '-m appo' for 4 > > > stages with atomic, parallel, paralled, ordered. Or maybe reuse your > > > test-eventdev parameter style? > > Any scheme is fine. > > > > > > > + " -q, --quiet Minimize printed output\n" > > > > > + " -D, --dump Print detailed statistics before > > > exit" > > > > > + "\n"; > > > > > + fprintf(stderr, "%s", usage_str); > > > > > + exit(1); > > > > > +} > > > > > + > > > > [...] > > > > > > > > > + rx_single = (popcnt == 1); > > > > > + break; > > > > > + case 't': > > > > > + tx_lcore_mask = parse_coremask(optarg); > > > > > + popcnt = __builtin_popcountll(tx_lcore_mask); > > > > > + tx_single = (popcnt == 1); > > > > > + break; > > > > > + case 'e': > > > > > + sched_lcore_mask = parse_coremask(optarg); > > > > > + popcnt = __builtin_popcountll(sched_lcore_mask); > > > > > + sched_single = (popcnt == 1); > > > > > + break; > > > > > + default: > > > > > + usage(); > > > > > + } > > > > > + } > > > > > + > > > > > + if (worker_lcore_mask == 0 || rx_lcore_mask == 0 || > > > > > + sched_lcore_mask == 0 || tx_lcore_mask == 0) { > > > > > + > > > > > + /* Q creation - one load balanced per pipeline stage*/ > > > > > + > > > > > + /* set up one port per worker, linking to all stage queues */ > > > > > + for (i = 0; i < num_workers; i++) { > > > > > + struct worker_data *w = &worker_data[i]; > > > > > + w->dev_id = dev_id; > > > > > + if (rte_event_port_setup(dev_id, i, &wkr_p_conf) < 0) { > > > > > + printf("Error setting up port %d\n", i); > > > > > + return -1; > > > > > + } > > > > > + > > > > > + uint32_t s; > > > > > + for (s = 0; s < num_stages; s++) { > > > > > + if (rte_event_port_link(dev_id, i, > > > > > + &worker_queues[s].queue_id, > > > > > + &worker_queues[s].priority, > > > > > + 1) != 1) { > > > > > + printf("%d: error creating link for port %d\n", > > > > > + __LINE__, i); > > > > > + return -1; > > > > > + } > > > > > + } > > > > > + w->port_id = i; > > > > > + } > > > > > + /* port for consumer, linked to TX queue */ > > > > > + if (rte_event_port_setup(dev_id, i, &tx_p_conf) < 0) { > > > > If ethdev supports MT txq queue support then this port can be linked to > > > > worker too. something to consider for future. > > > > > > > Sure. No change for now. > > OK > > Just to add a comment for any remaining comments above, we would hope that > none of them are blockers for the merge of the current version, as they can > be patched in the future as the infrastructure changes. > > Rgds, > Dave. > >