From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Subject: Re: [PATCH 1/3] examples/eventdev_pipeline: added sample
 app
Date: Thu, 29 Jun 2017 12:47:28 +0530
Message-ID: <20170629071726.GC5893@jerin>
References: <1492768299-84016-1-git-send-email-harry.van.haaren@intel.com>
 <1492768299-84016-2-git-send-email-harry.van.haaren@intel.com>
 <20170510141202.GA8431@jerin>
 <6c53f05d-2dd2-5b83-2eab-bcecd93bea82@intel.com>
 <20170627093506.GB14276@jerin>
 <72d1d150-a119-2b23-642c-484ca658c4b3@intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Harry van Haaren <harry.van.haaren@intel.com>, dev@dpdk.org,
 Gage Eads <gage.eads@intel.com>,
 Bruce Richardson <bruce.richardson@intel.com>
To: "Hunt, David" <david.hunt@intel.com>
Return-path: <dev-bounces@dpdk.org>
Received: from NAM03-DM3-obe.outbound.protection.outlook.com
 (mail-dm3nam03on0063.outbound.protection.outlook.com [104.47.41.63])
 by dpdk.org (Postfix) with ESMTP id 64F022BE1
 for <dev@dpdk.org>; Thu, 29 Jun 2017 09:18:28 +0200 (CEST)
Content-Disposition: inline
In-Reply-To: <72d1d150-a119-2b23-642c-484ca658c4b3@intel.com>
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

-----Original Message-----
> Date: Tue, 27 Jun 2017 14:12:20 +0100
> From: "Hunt, David" <david.hunt@intel.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: Harry van Haaren <harry.van.haaren@intel.com>, dev@dpdk.org, Gage Eads
>  <gage.eads@intel.com>, Bruce Richardson <bruce.richardson@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added
>  sample app
> User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101
>  Thunderbird/45.8.0
> 
> Hi Jerin:
> 
> 
> On 27/6/2017 10:35 AM, Jerin Jacob wrote:
> > -----Original Message-----
> > > Date: Mon, 26 Jun 2017 15:46:47 +0100
> > > From: "Hunt, David" <david.hunt@intel.com>
> > > To: Jerin Jacob <jerin.jacob@caviumnetworks.com>, Harry van Haaren
> > >   <harry.van.haaren@intel.com>
> > > CC: dev@dpdk.org, Gage Eads <gage.eads@intel.com>, Bruce Richardson
> > >   <bruce.richardson@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 1/3] examples/eventdev_pipeline: added
> > >   sample app
> > > User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101
> > >   Thunderbird/45.8.0
> > > 
> > > Hi Jerin,
> > Hi David,
> > 
> > Looks like you have sent the old version. The below mentioned comments
> > are not addressed in v2.
> 
> Oops. Glitch in the Matrix. I've just pushed a V3 with the changes.
> 
> > > I'm assisting Harry on the sample app, and have just pushed up a V2 patch
> > > based on your feedback. I've addressed most of your suggestions, comments
> > > below. There may still a couple of outstanding questions that need further
> > > discussion.
> > A few general comments:
> > 1) Nikhil/Gage's proposal on ethdev rx to eventdev adapter will change the major
> > portion(eventdev setup and producer()) of this application
> > 2) Producing one lcore worth of packets, really cant show as example
> > eventdev application as it will be pretty bad in low-end machine.
> > At least application infrastructure should not limit.
> > 
> > Considering above points, Should we wait for rx adapter to complete
> > first? I would like to show this as real world application to use eventdev.
> > 
> > Thoughts?
> > 
> > On the same note:
> > Can we finalize on rx adapter proposal? I can work on v1 of patch and
> > common code if Nikhil or Gage don't have bandwidth. Let me know?
> > 
> > last followup:
> > http://dpdk.org/ml/archives/dev/2017-June/068776.html
> 
> I had a quick chat with Harry, and wonder if we'd be as well to merge the
> app as it is now, and as the new frameworks become available, the app can be
> updated to make use of them? I feel it would be better to have something out
> there for people to play with than waiting for 17.11.

I agree with your concern.
How about renaming the test and doc specific to SW PMD and then once we
fix the known issues with HW eventdev + ethdev(Rx adapter) integration and then
rename the application to generic eventdev.


> 
> Also, if you have bandwidth to patch the app for your desired use cases,
> that would be a good contribution. I'd only be guessing for some of it :)
> 
> 
> > > Regards,
> > > Dave
> > > 
> > > On 10/5/2017 3:12 PM, Jerin Jacob wrote:
> > > > -----Original Message-----
> > > > > Date: Fri, 21 Apr 2017 10:51:37 +0100
> > > > > From: Harry van Haaren <harry.van.haaren@intel.com>
> > > > > To: dev@dpdk.org
> > > > > CC: jerin.jacob@caviumnetworks.com, Harry van Haaren
> > > > > <harry.van.haaren@intel.com>, Gage Eads <gage.eads@intel.com>, Bruce
> > > > >   Richardson <bruce.richardson@intel.com>
> > > > > Subject: [PATCH 1/3] examples/eventdev_pipeline: added sample app
> > > > > X-Mailer: git-send-email 2.7.4
> > > > > 
> > > > > This commit adds a sample app for the eventdev library.
> > > > > The app has been tested with DPDK 17.05-rc2, hence this
> > > > > release (or later) is recommended.
> > > > > 
> > > > > The sample app showcases a pipeline processing use-case,
> > > > > with event scheduling and processing defined per stage.
> > > > > The application recieves traffic as normal, with each
> > > > > packet traversing the pipeline. Once the packet has
> > > > > been processed by each of the pipeline stages, it is
> > > > > transmitted again.
> > > > > 
> > > > > The app provides a framework to utilize cores for a single
> > > > > role or multiple roles. Examples of roles are the RX core,
> > > > > TX core, Scheduling core (in the case of the event/sw PMD),
> > > > > and worker cores.
> > > > > 
> > > > > Various flags are available to configure numbers of stages,
> > > > > cycles of work at each stage, type of scheduling, number of
> > > > > worker cores, queue depths etc. For a full explaination,
> > > > > please refer to the documentation.
> > > > > 
> > > > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > > > Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
> > > > > Signed-off-by: Harry van Haaren <harry.van.haaren@intel.com>
> > > > Thanks for the example application to share the SW view.
> > > > I could make it run on HW after some tweaking(not optimized though)
> > > > 
> > > > [...]
> > > > > +#define MAX_NUM_STAGES 8
> > > > > +#define BATCH_SIZE 16
> > > > > +#define MAX_NUM_CORE 64
> > > > How about RTE_MAX_LCORE?
> > > Core usage in the sample app is held in a uint64_t. Adding arrays would be
> > > possible, but I feel that the extra effort would not give that much benefit.
> > > I've left as is for the moment, unless you see any strong requirement to go
> > > beyond 64 cores?
> > I think, it is OK. Again with service core infrastructure this will change.
> > 
> > > > > +
> > > > > +static unsigned int active_cores;
> > > > > +static unsigned int num_workers;
> > > > > +static unsigned long num_packets = (1L << 25); /* do ~32M packets */
> > > > > +static unsigned int num_fids = 512;
> > > > > +static unsigned int num_priorities = 1;
> > > > looks like its not used.
> > > Yes, Removed.
> > > 
> > > > > +static unsigned int num_stages = 1;
> > > > > +static unsigned int worker_cq_depth = 16;
> > > > > +static int queue_type = RTE_EVENT_QUEUE_CFG_ATOMIC_ONLY;
> > > > > +static int16_t next_qid[MAX_NUM_STAGES+1] = {-1};
> > > > > +static int16_t qid[MAX_NUM_STAGES] = {-1};
> > > > Moving all fastpath related variables under a structure with cache
> > > > aligned will help.
> > > I tried a few different combinations of this, and saw no gains, some losses.
> > > So will leave as is for the moment, if that's OK.
> > I think, the one are using in fastpath better to allocate from huge page
> > using rte_malloc()
> > 
> > > > > +static int worker_cycles;
> > > > > +static int enable_queue_priorities;
> > > > > +
> > > > > +struct prod_data {
> > > > > +    uint8_t dev_id;
> > > > > +    uint8_t port_id;
> > > > > +    int32_t qid;
> > > > > +    unsigned num_nic_ports;
> > > > > +};
> > > Yes, saw a percent or two gain when this plus following two data structs
> > > cache aligned.
> > looks like it not fixed in v2. Looks like you have sent the old
> > version.
> > 
> > > > > +
> > > > > +    return 0;
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +producer(void)
> > > > > +{
> > > > > +    static uint8_t eth_port;
> > > > > +    struct rte_mbuf *mbufs[BATCH_SIZE];
> > > > > +    struct rte_event ev[BATCH_SIZE];
> > > > > +    uint32_t i, num_ports = prod_data.num_nic_ports;
> > > > > +    int32_t qid = prod_data.qid;
> > > > > +    uint8_t dev_id = prod_data.dev_id;
> > > > > +    uint8_t port_id = prod_data.port_id;
> > > > > +    uint32_t prio_idx = 0;
> > > > > +
> > > > > +    const uint16_t nb_rx = rte_eth_rx_burst(eth_port, 0, mbufs,
> > > BATCH_SIZE);
> > > > > +    if (++eth_port == num_ports)
> > > > > +        eth_port = 0;
> > > > > +    if (nb_rx == 0) {
> > > > > +        rte_pause();
> > > > > +        return 0;
> > > > > +    }
> > > > > +
> > > > > +    for (i = 0; i < nb_rx; i++) {
> > > > > +        ev[i].flow_id = mbufs[i]->hash.rss;
> > > > prefetching the buff[i+1] may help here?
> > > I tried, didn't make much difference.
> > OK.
> > 
> > > > > +        ev[i].op = RTE_EVENT_OP_NEW;
> > > > > +        ev[i].sched_type = queue_type;
> > > > The value of RTE_EVENT_QUEUE_CFG_ORDERED_ONLY != RTE_SCHED_TYPE_ORDERED.
> > > So, we
> > > > cannot assign .sched_type as queue_type.
> > > > 
> > > > I think, one option could be to avoid translation in application is to
> > > > - Remove RTE_EVENT_QUEUE_CFG_ALL_TYPES, RTE_EVENT_QUEUE_CFG_*_ONLY
> > > > - Introduce a new RTE_EVENT_DEV_CAP_ to denote
> > > RTE_EVENT_QUEUE_CFG_ALL_TYPES cap
> > > > ability
> > > > - add sched_type in struct rte_event_queue_conf. If capability flag is
> > > >    not set then implementation takes sched_type value for the queue.
> > > > 
> > > > Any thoughts?
> > > 
> > > Not sure here, would it be ok for the moment, and we can work on a patch in
> > > the future?
> > OK
> > 
> > > > > +
> > > > > +    if (tx_core[lcore_id] && (tx_single ||
> > > > > +        rte_atomic32_cmpset(&tx_lock, 0, 1))) {
> > > > > +        consumer();
> > > > Should consumer() need to come in this pattern? I am thinking like
> > > > if events is from last stage then call consumer() in worker()
> > > > 
> > > > I think, above scheme works better when the _same_ worker code need to run
> > > the
> > > > case where
> > > > 1) ethdev HW is capable to enqueuing the packets to same txq from
> > > >    multiple thread
> > > > 2) ethdev is not capable to do so.
> > > > 
> > > > So, The above cases can be addressed in configuration time where we link
> > > > the queues to port
> > > > case 1) Link all workers to last queue
> > > > case 2) Link only worker to last queue
> > > > 
> > > > and keeping the common worker code.
> > > > 
> > > > HW implementation has functional and performance issue if "two" ports are
> > > > assigned to one lcore for dequeue. The above scheme fixes that problem
> > > too.
> > > 
> > > 
> > > Can we have a bit more discussion on this item? Is this needed for this
> > > sample app, or can we perhaps work a patch for this later? Harry?
> > As explained above, Is there any issue in keeping consumer() for last
> > stage ?
> 
> I would probably see this as a future enhancement as per my initial comments
> above. Any hardware or new framework additions are welcome as future patches
> to the app.
> 
> > > 
> > > > > +        rte_atomic32_clear((rte_atomic32_t *)&tx_lock);
> > > > > +    }
> > > > > +}
> > > > > +
> > > > > +static int
> > > > > +worker(void *arg)
> > > > > +{
> > > > > +    struct rte_event events[BATCH_SIZE];
> > > > > +
> > > > > +    struct worker_data *data = (struct worker_data *)arg;
> > > > > +    uint8_t dev_id = data->dev_id;
> > > > > +    uint8_t port_id = data->port_id;
> > > > > +    size_t sent = 0, received = 0;
> > > > > +    unsigned lcore_id = rte_lcore_id();
> > > > > +
> > > > > +    while (!done) {
> > > > > +        uint16_t i;
> > > > > +
> > > > > +        schedule_devices(dev_id, lcore_id);
> > > > > +
> > > > > +        if (!worker_core[lcore_id]) {
> > > > > +            rte_pause();
> > > > > +            continue;
> > > > > +        }
> > > > > +
> > > > > +        uint16_t nb_rx = rte_event_dequeue_burst(dev_id, port_id,
> > > > > +                events, RTE_DIM(events), 0);
> > > > > +
> > > > > +        if (nb_rx == 0) {
> > > > > +            rte_pause();
> > > > > +            continue;
> > > > > +        }
> > > > > +        received += nb_rx;
> > > > > +
> > > > > +        for (i = 0; i < nb_rx; i++) {
> > > > > +            struct ether_hdr *eth;
> > > > > +            struct ether_addr addr;
> > > > > +            struct rte_mbuf *m = events[i].mbuf;
> > > > > +
> > > > > +            /* The first worker stage does classification */
> > > > > +            if (events[i].queue_id == qid[0])
> > > > > +                events[i].flow_id = m->hash.rss % num_fids;
> > > > Not sure why we need do(shrinking the flows) this in worker() in queue
> > > based pipeline.
> > > > If an PMD has any specific requirement on num_fids,I think, we
> > > > can move this configuration stage or PMD can choose optimum fid internally
> > > to
> > > > avoid modulus operation tax in fastpath in all PMD.
> > > > 
> > > > Does struct rte_event_queue_conf.nb_atomic_flows help here?
> > > In my tests the modulus makes very little difference in the throughput. And
> > > I think it's good to have a way of varying the number of flows for testing
> > > different scenarios, even if it's not the most performant.
> > Not sure.
> > 
> > > > > +
> > > > > +            events[i].queue_id = next_qid[events[i].queue_id];
> > > > > +            events[i].op = RTE_EVENT_OP_FORWARD;
> > > > missing events[i].sched_type.HW PMD does not work with this.
> > > > I think, we can use similar scheme like next_qid for next_sched_type.
> > > Done. added events[i].sched_type = queue_type.
> > > 
> > > > > +
> > > > > +            /* change mac addresses on packet (to use mbuf data) */
> > > > > +            eth = rte_pktmbuf_mtod(m, struct ether_hdr *);
> > > > > +            ether_addr_copy(&eth->d_addr, &addr);
> > > > > +            ether_addr_copy(&eth->s_addr, &eth->d_addr);
> > > > > +            ether_addr_copy(&addr, &eth->s_addr);
> > > > IMO, We can make packet processing code code as "static inline function"
> > > so
> > > > different worker types can reuse.
> > > Done. moved out to a work() function.
> > I think, mac swap should do in last stage, not on each forward.
> > ie. With existing code, 2 stage forward makes in original order.
> > 
> > > > > +
> > > > > +            /* do a number of cycles of work per packet */
> > > > > +            volatile uint64_t start_tsc = rte_rdtsc();
> > > > > +            while (rte_rdtsc() < start_tsc + worker_cycles)
> > > > > +                rte_pause();
> > > > Ditto.
> > > Done. moved out to a work() function.
> > > 
> > > > I think, All worker specific variables like "worker_cycles" can moved into
> > > > one structure and use.
> > > > 
> > > > > +        }
> > > > > +        uint16_t nb_tx = rte_event_enqueue_burst(dev_id, port_id,
> > > > > +                events, nb_rx);
> > > > > +        while (nb_tx < nb_rx && !done)
> > > > > +            nb_tx += rte_event_enqueue_burst(dev_id, port_id,
> > > > > +                            events + nb_tx,
> > > > > +                            nb_rx - nb_tx);
> > > > > +        sent += nb_tx;
> > > > > +    }
> > > > > +
> > > > > +    if (!quiet)
> > > > > +        printf("  worker %u thread done. RX=%zu TX=%zu\n",
> > > > > +                rte_lcore_id(), received, sent);
> > > > > +
> > > > > +    return 0;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Parse the coremask given as argument (hexadecimal string) and fill
> > > > > + * the global configuration (core role and core count) with the parsed
> > > > > + * value.
> > > > > + */
> > > > > +static int xdigit2val(unsigned char c)
> > > > multiple instance of "xdigit2val" in DPDK repo. May be we can push this
> > > > as common code.
> > > Sure, that's something we can look at in a separate patch, now that it's
> > > being used more and more.
> > make sense.
> > 
> > > > > +{
> > > > > +    int val;
> > > > > +
> > > > > +    if (isdigit(c))
> > > > > +        val = c - '0';
> > > > > +    else if (isupper(c))
> > > > > +        val = c - 'A' + 10;
> > > > > +    else
> > > > > +        val = c - 'a' + 10;
> > > > > +    return val;
> > > > > +}
> > > > > +
> > > > > +
> > > > > +static void
> > > > > +usage(void)
> > > > > +{
> > > > > +    const char *usage_str =
> > > > > +        "  Usage: eventdev_demo [options]\n"
> > > > > +        "  Options:\n"
> > > > > +        "  -n, --packets=N              Send N packets (default ~32M), 0
> > > implies no limit\n"
> > > > > +        "  -f, --atomic-flows=N         Use N random flows from 1 to N
> > > (default 16)\n"
> > > > I think, this parameter now, effects the application fast path code.I
> > > think,
> > > > it should eventdev configuration para-mater.
> > > See above comment on num_fids
> > > 
> > > > > +        "  -s, --num_stages=N           Use N atomic stages (default
> > > 1)\n"
> > > > > +        "  -r, --rx-mask=core mask      Run NIC rx on CPUs in core
> > > mask\n"
> > > > > +        "  -w, --worker-mask=core mask  Run worker on CPUs in core
> > > mask\n"
> > > > > +        "  -t, --tx-mask=core mask      Run NIC tx on CPUs in core
> > > mask\n"
> > > > > +        "  -e  --sched-mask=core mask   Run scheduler on CPUs in core
> > > mask\n"
> > > > > +        "  -c  --cq-depth=N             Worker CQ depth (default 16)\n"
> > > > > +        "  -W  --work-cycles=N          Worker cycles (default 0)\n"
> > > > > +        "  -P  --queue-priority         Enable scheduler queue
> > > prioritization\n"
> > > > > +        "  -o, --ordered                Use ordered scheduling\n"
> > > > > +        "  -p, --parallel               Use parallel scheduling\n"
> > > > IMO, all stage being "parallel" or "ordered" or "atomic" is one mode of
> > > > operation. It is valid have to any combination. We need to express that in
> > > > command like
> > > > example:
> > > > 3 stage with
> > > > O->A->P
> > > How about we add an option that specifies the mode of operation for each
> > > stage in a string? Maybe have a '-m' option (modes) e.g. '-m appo' for 4
> > > stages with atomic, parallel, paralled, ordered. Or maybe reuse your
> > > test-eventdev parameter style?
> > Any scheme is fine.
> > 
> > > > > +        "  -q, --quiet                  Minimize printed output\n"
> > > > > +        "  -D, --dump                   Print detailed statistics before
> > > exit"
> > > > > +        "\n";
> > > > > +    fprintf(stderr, "%s", usage_str);
> > > > > +    exit(1);
> > > > > +}
> > > > > +
> > > > [...]
> > > > 
> > > > > +            rx_single = (popcnt == 1);
> > > > > +            break;
> > > > > +        case 't':
> > > > > +            tx_lcore_mask = parse_coremask(optarg);
> > > > > +            popcnt = __builtin_popcountll(tx_lcore_mask);
> > > > > +            tx_single = (popcnt == 1);
> > > > > +            break;
> > > > > +        case 'e':
> > > > > +            sched_lcore_mask = parse_coremask(optarg);
> > > > > +            popcnt = __builtin_popcountll(sched_lcore_mask);
> > > > > +            sched_single = (popcnt == 1);
> > > > > +            break;
> > > > > +        default:
> > > > > +            usage();
> > > > > +        }
> > > > > +    }
> > > > > +
> > > > > +    if (worker_lcore_mask == 0 || rx_lcore_mask == 0 ||
> > > > > +        sched_lcore_mask == 0 || tx_lcore_mask == 0) {
> > > > > +
> > > > > +    /* Q creation - one load balanced per pipeline stage*/
> > > > > +
> > > > > +    /* set up one port per worker, linking to all stage queues */
> > > > > +    for (i = 0; i < num_workers; i++) {
> > > > > +        struct worker_data *w = &worker_data[i];
> > > > > +        w->dev_id = dev_id;
> > > > > +        if (rte_event_port_setup(dev_id, i, &wkr_p_conf) < 0) {
> > > > > +            printf("Error setting up port %d\n", i);
> > > > > +            return -1;
> > > > > +        }
> > > > > +
> > > > > +        uint32_t s;
> > > > > +        for (s = 0; s < num_stages; s++) {
> > > > > +            if (rte_event_port_link(dev_id, i,
> > > > > +                        &worker_queues[s].queue_id,
> > > > > +                        &worker_queues[s].priority,
> > > > > +                        1) != 1) {
> > > > > +                printf("%d: error creating link for port %d\n",
> > > > > +                        __LINE__, i);
> > > > > +                return -1;
> > > > > +            }
> > > > > +        }
> > > > > +        w->port_id = i;
> > > > > +    }
> > > > > +    /* port for consumer, linked to TX queue */
> > > > > +    if (rte_event_port_setup(dev_id, i, &tx_p_conf) < 0) {
> > > > If ethdev supports MT txq queue support then this port can be linked to
> > > > worker too. something to consider for future.
> > > > 
> > > Sure. No change for now.
> > OK
> 
> Just to add a comment for any remaining comments above, we would hope that
> none of them are blockers for the merge of the current version, as they can
> be patched in the future as the infrastructure changes.
> 
> Rgds,
> Dave.
> 
>