Prioritizing URIs with tight performance requirement in openBmc with bmcweb

All of lore.kernel.org
 help / color / mirror / Atom feed

* Prioritizing URIs with tight performance requirement in openBmc with bmcweb
@ 2023-05-24  9:35 Rohit Pai
  2023-05-24 16:26 ` Ed Tanous
  0 siblings, 1 reply; 9+ messages in thread
From: Rohit Pai @ 2023-05-24  9:35 UTC (permalink / raw)
  To: openbmc

[-- Attachment #1: Type: text/plain, Size: 1792 bytes --]

Hello All,

We have a requirement in our platform to serve a few specific URI with a tight performance requirement on the turnaround time (latency).
One such example is the telemetry sensor metric URI which has power, thermal data can have a max turnaround time of 500ms.

The current bmcweb design uses only a single thread to serve all URI requests/responses.
If bmcweb is processing a huge amount of data (which is common for aggregation URIs) then other requests would get blocked and their latency time would get impacted.
Here I am referring to the time bmcweb takes to prepare the JSON response after it has got the data from the backend service.
In our platform, we see that power thermal metric URI can take more than 500ms when it's requested in parallel to other aggregation URI which have huge response data.

To solve this problem, we thought of a couple of solutions.

  1.  To introduce multi-threading support in bmcweb.
Does anyone have any experience/feedback on making this work?
Is there any strong reason not to have multi-threading support in bmcweb other than general guidelines to avoid threads?

  1.  To use a reverse proxy like nginx as the front end to redirect a few URIs to a new application server.
Here the idea is to develop a new application server to serve the URIs which have strong latency requirements and route the rest of the URIs to bmcweb.
       Has anyone experienced any limitations with nginx on openBmc platforms (w.r.t performance, memory footprint, etc)?
       We also have the requirement to support SSE, Is there any known limitation to make such a feature work with nginx?

Any other suggestion or solution to the problem we are solving to meet our performance requirement with bmcweb?

Thanks
Rohit PAI

[-- Attachment #2: Type: text/html, Size: 7965 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Prioritizing URIs with tight performance requirement in openBmc with bmcweb
  2023-05-24  9:35 Prioritizing URIs with tight performance requirement in openBmc with bmcweb Rohit Pai
@ 2023-05-24 16:26 ` Ed Tanous
  2023-05-24 17:56   ` Ed Tanous
  2023-05-26 15:16   ` Rohit Pai
  0 siblings, 2 replies; 9+ messages in thread
From: Ed Tanous @ 2023-05-24 16:26 UTC (permalink / raw)
  To: Rohit Pai; +Cc: openbmc

On Wed, May 24, 2023 at 2:36 AM Rohit Pai <ropai@nvidia.com> wrote:
>
> Hello All,
>
>
>
> We have a requirement in our platform to serve a few specific URI with a tight performance requirement on the turnaround time (latency).
>
> One such example is the telemetry sensor metric URI which has power, thermal data can have a max turnaround time of 500ms.

What other constraints are here?  We're talking about a TCP protocol,
running on a network, on a multi-process CPU.  Are these hard realtime
requirements?  It's unlikely you're going to get hard realtime
guarantees from Redfish.

>
>
>
> The current bmcweb design uses only a single thread to serve all URI requests/responses.
>
> If bmcweb is processing a huge amount of data (which is common for aggregation URIs) then other requests would get blocked and their latency time would get impacted.

The bmcweb queuing flow looks something like:

        A            B                                 C             D
TCP─►TLS──►HTTP Connection─►DBus─►Daemon

Which location are you seeing queuing problems?  Keep in mind, HTTP
1.1 can only process a single request/response at a time per
connection, so if your system is trying to process things from a
single connection at A, you're right, long requests will block short
ones.

>
> Here I am referring to the time bmcweb takes to prepare the JSON response after it has got the data from the backend service.

What is the ballpark for how big "huge amount" of data would be?  What
processing is actually being done?   This would be the first time that
json parsing itself has actually shown up on a performance profile,
but with expand + aggregation, you're right, there's potential for
that.

One thing I've considered before is switching bmcweb over to
boost::json, which can do incremental chunked parsing, unlike
nlohmann, which would let us unblock the flows as each processes the
data.

>
> In our platform, we see that power thermal metric URI can take more than 500ms when it’s requested in parallel to other aggregation URI which have huge response data.

Can you share your test?  Is your test using multiple connections to
ensure that the thermal metric is being pulled from a different
connection than the aggregation URI?

>
>
>
> To solve this problem, we thought of a couple of solutions.
>
>
>
> To introduce multi-threading support in bmcweb.

Sure, I have no problem with adding threads, and it really wouldn't be
tough to accomplish as a test:
1. Link pthreads in meson.  Make this a meson option so platforms that
don't need multiple threads can opt out of it.
2. go to each async_read and async_write call, and ensure that they
are using a strand (to keep processing on the same thread for any one
call).
3. Locate all of the global and cross connection data structures, and
add a mutex to each of them.  One of the global data structures is the
Dbus connection itself, so if your performance problem exists on C or
D above, it will likely still exist with multiple threads.
4. Update sdbusplus asio connection to support strands, ensuring that
the callbacks happen on the same thread they're requested.
Alternatively, just set up a dbus connection per thread.
5. Test heavily to make sure we don't have threading access problems
or missing mutexes.
6. Update the DEVELOPING.md doc to account for multiple threads in the
way we review code. (reentrancy, etc).  Most of the existing code
should be reentrant, but it's worth looking.
There's likely a few other minor things that would need fixed, but the
above is the general gist.

>
> Does anyone have any experience/feedback on making this work?
>
> Is there any strong reason not to have multi-threading support in bmcweb other than general guidelines to avoid threads?

It increases the binary size beyond what can fit on a lot of BMCs
(about 10-20%) This is fine so long as you keep it as a compile option
so people can opt into threading support.  Historically, teaching and
reviewing multi-threaded code has been an order of magnitude more
difficult than single threaded code, so keeping the single thread
significantly improves the review process, so please plan on having
folks prepared to review code for multi-threaded correctness.

>
>
>
> To use a reverse proxy like nginx as the front end to redirect a few URIs to a new application server.

Please take a look at the OpenBMC tree around 2018-2019.  There were
several platforms that formerly used nginx as the front end to bmcweb,
and have since dropped it.  There was also a discussion on discord
recently you might look at.  I'm not really sure how nginx would solve
your problem though.  The bmcweb reactor design looks similar to nginx
(we use asio, they use libuv) already, so it's not clear to me what
you would gain here, unless you were running multiple processes of
bmcweb?  Keep in mind, there'd need to be some sort of shared state in
that case, so you have to do #3 in the above anyway.

>
> Here the idea is to develop a new application server to serve the URIs which have strong latency requirements and route the rest of the URIs to bmcweb.

This is the part I don't understand;  If the forwarding calls in this
new server are blocking to bmcweb, what's the point of adding it?
Feel free to just show the code of this working as well.

>
>        Has anyone experienced any limitations with nginx on openBmc platforms (w.r.t performance, memory footprint, etc)?
>
>        We also have the requirement to support SSE, Is there any known limitation to make such a feature work with nginx?

It can be made to work.  AuthX tends to be the harder part, as
implementing CSRF for SSE or websockets is a huge pain.

>
>
>
>
>
> Any other suggestion or solution to the problem we are solving to meet our performance requirement with bmcweb?

1. Audit your code for any blocking calls.  If you have any, put them
into a loop, process X bytes at a time, while calling
boost::asio::dispatch in between, to not starve the other tasks.
2. Move the bmcweb core to a json library that can do incremental
serialization/deserialization.  boost::json would be my first choice.
3. I have patches to turn on uring, which lets us use
boost::asio::random_access_file to fix #1 for blocking filesystem
calls.
4. Set reasonable limits on the max aggregation size that is allowed
at a system level.  There were some proposals on gerrit.

I would be worried about separating code into two classes (high
priority/low priority) because everyone's opinions will differ about
what should be "high" priority, and what should be "low" priority.  If
that isn't configurable easily, I worry that we're going to have
problems agreeing on priority, and I don't want to be in a situation
where every developer is having to make priority calls for every
system class.  I'm open to the possibility here, but we need to make
sure it keeps code moving.

>
>
>
>
>
> Thanks
>
> Rohit PAI
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Prioritizing URIs with tight performance requirement in openBmc with bmcweb
  2023-05-24 16:26 ` Ed Tanous
@ 2023-05-24 17:56   ` Ed Tanous
  2023-06-01 18:33     ` Ed Tanous
  2023-05-26 15:16   ` Rohit Pai
  1 sibling, 1 reply; 9+ messages in thread
From: Ed Tanous @ 2023-05-24 17:56 UTC (permalink / raw)
  To: Rohit Pai; +Cc: openbmc

On Wed, May 24, 2023 at 9:26 AM Ed Tanous <edtanous@google.com> wrote:
>
> There's likely a few other minor things that would need fixed, but the
> above is the general gist.

I spent an hour or so and put together a simple POC of multithreading
bmcweb.  It's not threadsafe for all the global structures (although I
did one as an example), and has a lot of safety issues we'll need to
work through, but will give us a hint about whether multi-threading
bmcweb will solve your performance problem:
https://gerrit.openbmc.org/c/openbmc/bmcweb/+/63710

PTAL

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Prioritizing URIs with tight performance requirement in openBmc with bmcweb
  2023-05-24 16:26 ` Ed Tanous
  2023-05-24 17:56   ` Ed Tanous
@ 2023-05-26 15:16   ` Rohit Pai
  2023-05-26 16:29     ` Ed Tanous
  1 sibling, 1 reply; 9+ messages in thread
From: Rohit Pai @ 2023-05-26 15:16 UTC (permalink / raw)
  To: Ed Tanous; +Cc: openbmc

Hello Ed, 

Thanks for your response and feedback. Please find my answers to your questions below. 

> > What other constraints are here?  We're talking about a TCP protocol, running on a network, on a multi-process CPU.  Are these hard realtime requirements?  It's unlikely you're going to get hard realtime guarantees from Redfish.
No these are not hard real time requirements. The latency can go more that 500ms in some instances, but we want to target the outliers (samples which take more than 500ms of TAT) to be less than one percent of the total samples. 
We also want to limit the max latency of the power, thermal metric URI under 1s. The fabric bmc periodically polls this URIs from our platform and implements the PID loop control based on the response of this URI so its critical for us to limit the max latency of this URI. 

>>         A            B                                 C             D
>> TCP─►TLS──►HTTP Connection─►DBus─►Daemon
>> Which location are you seeing queuing problems?  
The time spent in route handler after getting the response from the backend daemon is affecting the next URI which is waiting in the queue. 
From our tests we see that the backend daemon responds on an average within 200ms. Since this is async this has no impact. 
The response received from the dbus backend has around 2500 properties to process. Processing of this in the route handler and preparing the response it's taking avg time of 200ms (max time of 400 ms and several instances going close to 300ms). The Avg CPU consumption during the test is 50% and avg bmcweb consumption is between 12 to 15%. 

>> What is the ballpark for how big "huge amount" of data would be?  What processing is being done?   
The JSON in the HTTP response is around 400KB. 
The route handler logic makes the GetManagedObjects call to the backend service and processes the response which has around 2500 properties to loop through. 
There is no specific business logic here which is time consuming but it's the number of properties which seem to have significant impact. 

>> One thing I've considered before is switching bmcweb over to boost::json, which can do incremental chunked parsing, unlike nlohmann, which would let us unblock the flows as each processes the data.
Do you have any pointers for this ? where exactly we would be unblocking the flow in route handler after processing chunk of properties from the backend OR somewhere else? 

>> Can you share your test?  Is your test using multiple connections to ensure that the thermal metric is being pulled from a different connection than the aggregation URI?
In the test we have two clients connected to our platform. 
One client does power, thermal metric URI polling for every 500ms. We have single aggregate URI for this which combine all power, thermal sensors. We have close to 100 sensors. This is the URI where we have strict performance requirement. 
Other client periodically polls stats and counters aggregate URI for every 5 seconds. This is the URI which has around 400KB of JSON response and has 2500 dbus properties to process. We are not very particular about the latency of the
stats and counters URI. In the test result we see that power, thermal metric URI has max latency time of 1.2 seconds (min is 2ms and avg is 9ms) and around 3.5% outliers (samples took more than 500ms). We can see that whenever bmcweb code is busy preparing the response for the stats and counter URI and there is request for thermal metric URI then latency of the thermal metric URIs is affected. 
If we limit the number of properties in the stats and counters URI, then we can meet the requirement, but we need to create lot of aggregate URIs. It would not be convenient for the customers to use many aggregate URIs and then combine the response.  Is there way to process chunk of properties in the route handler and voluntarily release the context for bmcweb to process the next URIs in the queue ? any other trick which can work here ? 

> Here the idea is to develop a new application server to serve the URIs which have strong latency requirements and route the rest of the URIs to bmcweb.
>> This is the part I don't understand;  If the forwarding calls in this new server are blocking to bmcweb, what's the point of adding it?
The backend for the thermal metrics and stats/counters metrics is different so we would not be blocked on the same service. 
With a reverse proxy our idea is to forward thermal metrics to another backend application server (new bmcweb server or a similar lightweight server instance) and all other URIs to the existing bmcweb. 

I will share any further results from our internal tests as we tryout different things. 

Thanks 
Rohit PAI 

-----Original Message-----
From: Ed Tanous <edtanous@google.com> 
Sent: Wednesday, May 24, 2023 9:57 PM
To: Rohit Pai <ropai@nvidia.com>
Cc: openbmc@lists.ozlabs.org
Subject: Re: Prioritizing URIs with tight performance requirement in openBmc with bmcweb

External email: Use caution opening links or attachments

On Wed, May 24, 2023 at 2:36 AM Rohit Pai <ropai@nvidia.com> wrote:
>
> Hello All,
>
>
>
> We have a requirement in our platform to serve a few specific URI with a tight performance requirement on the turnaround time (latency).
>
> One such example is the telemetry sensor metric URI which has power, thermal data can have a max turnaround time of 500ms.

What other constraints are here?  We're talking about a TCP protocol, running on a network, on a multi-process CPU.  Are these hard realtime requirements?  It's unlikely you're going to get hard realtime guarantees from Redfish.

>
>
>
> The current bmcweb design uses only a single thread to serve all URI requests/responses.
>
> If bmcweb is processing a huge amount of data (which is common for aggregation URIs) then other requests would get blocked and their latency time would get impacted.

The bmcweb queuing flow looks something like:

        A            B                                 C             D
TCP─►TLS──►HTTP Connection─►DBus─►Daemon

Which location are you seeing queuing problems?  Keep in mind, HTTP
1.1 can only process a single request/response at a time per connection, so if your system is trying to process things from a single connection at A, you're right, long requests will block short ones.

>
> Here I am referring to the time bmcweb takes to prepare the JSON response after it has got the data from the backend service.

What is the ballpark for how big "huge amount" of data would be?  What
processing is actually being done?   This would be the first time that
json parsing itself has actually shown up on a performance profile, but with expand + aggregation, you're right, there's potential for that.

One thing I've considered before is switching bmcweb over to boost::json, which can do incremental chunked parsing, unlike nlohmann, which would let us unblock the flows as each processes the data.

>
> In our platform, we see that power thermal metric URI can take more than 500ms when it’s requested in parallel to other aggregation URI which have huge response data.

Can you share your test?  Is your test using multiple connections to ensure that the thermal metric is being pulled from a different connection than the aggregation URI?

>
>
>
> To solve this problem, we thought of a couple of solutions.
>
>
>
> To introduce multi-threading support in bmcweb.

Sure, I have no problem with adding threads, and it really wouldn't be tough to accomplish as a test:
1. Link pthreads in meson.  Make this a meson option so platforms that don't need multiple threads can opt out of it.
2. go to each async_read and async_write call, and ensure that they are using a strand (to keep processing on the same thread for any one call).
3. Locate all of the global and cross connection data structures, and add a mutex to each of them.  One of the global data structures is the Dbus connection itself, so if your performance problem exists on C or D above, it will likely still exist with multiple threads.
4. Update sdbusplus asio connection to support strands, ensuring that the callbacks happen on the same thread they're requested.
Alternatively, just set up a dbus connection per thread.
5. Test heavily to make sure we don't have threading access problems or missing mutexes.
6. Update the DEVELOPING.md doc to account for multiple threads in the way we review code. (reentrancy, etc).  Most of the existing code should be reentrant, but it's worth looking.
There's likely a few other minor things that would need fixed, but the above is the general gist.

>
> Does anyone have any experience/feedback on making this work?
>
> Is there any strong reason not to have multi-threading support in bmcweb other than general guidelines to avoid threads?

It increases the binary size beyond what can fit on a lot of BMCs (about 10-20%) This is fine so long as you keep it as a compile option so people can opt into threading support.  Historically, teaching and reviewing multi-threaded code has been an order of magnitude more difficult than single threaded code, so keeping the single thread significantly improves the review process, so please plan on having folks prepared to review code for multi-threaded correctness.

>
>
>
> To use a reverse proxy like nginx as the front end to redirect a few URIs to a new application server.

Please take a look at the OpenBMC tree around 2018-2019.  There were several platforms that formerly used nginx as the front end to bmcweb, and have since dropped it.  There was also a discussion on discord recently you might look at.  I'm not really sure how nginx would solve your problem though.  The bmcweb reactor design looks similar to nginx (we use asio, they use libuv) already, so it's not clear to me what you would gain here, unless you were running multiple processes of bmcweb?  Keep in mind, there'd need to be some sort of shared state in that case, so you have to do #3 in the above anyway.

>
> Here the idea is to develop a new application server to serve the URIs which have strong latency requirements and route the rest of the URIs to bmcweb.

This is the part I don't understand;  If the forwarding calls in this new server are blocking to bmcweb, what's the point of adding it?
Feel free to just show the code of this working as well.

>
>        Has anyone experienced any limitations with nginx on openBmc platforms (w.r.t performance, memory footprint, etc)?
>
>        We also have the requirement to support SSE, Is there any known limitation to make such a feature work with nginx?

It can be made to work.  AuthX tends to be the harder part, as implementing CSRF for SSE or websockets is a huge pain.

>
>
>
>
>
> Any other suggestion or solution to the problem we are solving to meet our performance requirement with bmcweb?

1. Audit your code for any blocking calls.  If you have any, put them into a loop, process X bytes at a time, while calling boost::asio::dispatch in between, to not starve the other tasks.
2. Move the bmcweb core to a json library that can do incremental serialization/deserialization.  boost::json would be my first choice.
3. I have patches to turn on uring, which lets us use boost::asio::random_access_file to fix #1 for blocking filesystem calls.
4. Set reasonable limits on the max aggregation size that is allowed at a system level.  There were some proposals on gerrit.

I would be worried about separating code into two classes (high priority/low priority) because everyone's opinions will differ about what should be "high" priority, and what should be "low" priority.  If that isn't configurable easily, I worry that we're going to have problems agreeing on priority, and I don't want to be in a situation where every developer is having to make priority calls for every system class.  I'm open to the possibility here, but we need to make sure it keeps code moving.

>
>
>
>
>
> Thanks
>
> Rohit PAI
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Prioritizing URIs with tight performance requirement in openBmc with bmcweb
  2023-05-26 15:16   ` Rohit Pai
@ 2023-05-26 16:29     ` Ed Tanous
  0 siblings, 0 replies; 9+ messages in thread
From: Ed Tanous @ 2023-05-26 16:29 UTC (permalink / raw)
  To: Rohit Pai; +Cc: openbmc

On Fri, May 26, 2023 at 8:16 AM Rohit Pai <ropai@nvidia.com> wrote:
>
> Hello Ed,
>
> Thanks for your response and feedback. Please find my answers to your questions below.

One additional test, can you enable handler tracking and provide the result?
https://www.boost.org/doc/libs/1_82_0/doc/html/boost_asio/overview/core/handler_tracking.html

That will give a better idea of what specific callback is stalling,
and for how long.
One thing I've considered is just putting a hard cap of, say, 100ms on
the individual handlers, where if any handler takes longer than that,
print to the console.  That would allow us to much more easily find
the flows that are taking too long in practice.  I would expect most
individual context handlers to complete in < 10 ms, unless they're
processing too much data.

>
> > > What other constraints are here?  We're talking about a TCP protocol, running on a network, on a multi-process CPU.  Are these hard realtime requirements?  It's unlikely you're going to get hard realtime guarantees from Redfish.
> No these are not hard real time requirements. The latency can go more that 500ms in some instances, but we want to target the outliers (samples which take more than 500ms of TAT) to be less than one percent of the total samples.
> We also want to limit the max latency of the power, thermal metric URI under 1s. The fabric bmc periodically polls this URIs from our platform and implements the PID loop control based on the response of this URI so its critical for us to limit the max latency of this URI.
>
> >>         A            B                                 C             D
> >> TCP─►TLS──►HTTP Connection─►DBus─►Daemon
> >> Which location are you seeing queuing problems?
> The time spent in route handler after getting the response from the backend daemon is affecting the next URI which is waiting in the queue.

To make sure we're clear, the route handler is the part JUST calling
DBus and turning it into json objects.  The json object serialization
happens outside the router call, so it isn't taking significant time
in your tests?

> From our tests we see that the backend daemon responds on an average within 200ms. Since this is async this has no impact.
> The response received from the dbus backend has around 2500 properties to process. Processing of this in the route handler and preparing the response it's taking avg time of 200ms (max time of 400 ms and several instances going close to 300ms). The Avg CPU consumption during the test is 50% and avg bmcweb consumption is between 12 to 15%.

This is not so bad;  I'm assuming you're not running encryption?

>
>
> >> What is the ballpark for how big "huge amount" of data would be?  What processing is being done?
> The JSON in the HTTP response is around 400KB.

First of all, let's do the easy things to reduce the size.  Start by
making the json not pretty print (change the nlohmann::json::dump()
call in http_connection.hpp to do that), or switch your clients to use
CBOR encoding, which was added a few months ago, to reduce the payload
size being sent on the wire.

> The route handler logic makes the GetManagedObjects call to the backend service and processes the response which has around 2500 properties to loop through.

Can you please change this to Properties GetAll and see if that
improves your consistency?  We never really intended messages to get
THAT big, so the fact that you're seeing some blocking isn't that
surprising.  Also keep in mind, the data structures we use for the
GetManagedObjects call haven't really been optimized, and create a
bunch of intermediates.  If you profile this, you could likely get the
DBus message -> nlohmann json processing time down by an order of
magnitude.

> There is no specific business logic here which is time consuming but it's the number of properties which seem to have significant impact.
>
> >> One thing I've considered before is switching bmcweb over to boost::json, which can do incremental chunked parsing, unlike nlohmann, which would let us unblock the flows as each processes the data.
> Do you have any pointers for this ? where exactly we would be unblocking the flow in route handler after processing chunk of properties from the backend OR somewhere else?

boost/json is in the package and has better documentation than I have.
In terms of unblocking, we could actually write out values
incrementally as the dbus responses are returned, so we reduce the
memory required, and in theory reduce the amount of time we block on
any one handler.  Even if we didn't do that fully, we could, say,
process 4kB worth of data, then return control to the io_context
through boost::asio::dispatch so that other traffic could be
processed.

>
> >> Can you share your test?  Is your test using multiple connections to ensure that the thermal metric is being pulled from a different connection than the aggregation URI?
> In the test we have two clients connected to our platform.
> One client does power, thermal metric URI polling for every 500ms. We have single aggregate URI for this which combine all power, thermal sensors. We have close to 100 sensors. This is the URI where we have strict performance requirement.
> Other client periodically polls stats and counters aggregate URI for every 5 seconds. This is the URI which has around 400KB of JSON response and has 2500 dbus properties to process. We are not very particular about the latency of the
> stats and counters URI. In the test result we see that power, thermal metric URI has max latency time of 1.2 seconds (min is 2ms and avg is 9ms) and around 3.5% outliers (samples took more than 500ms). We can see that whenever bmcweb code is busy preparing the response for the stats and counter URI and there is request for thermal metric URI then latency of the thermal metric URIs is affected.
> If we limit the number of properties in the stats and counters URI, then we can meet the requirement, but we need to create lot of aggregate URIs. It would not be convenient for the customers to use many aggregate URIs and then combine the response.  Is there way to process chunk of properties in the route handler and voluntarily release the context for bmcweb to process the next URIs in the queue ? any other trick which can work here ?

Can you share your test code?  Which URIs are you hitting that are
calling GetManagedObjects?

FWIW, as an overall theme, I do wonder if the speed gained by going to
GetManagedObjects is worth it in the long run.  It seems to
significantly effect stability, because even objects added that the
route doesn't care about have to be processed, and as systems are
getting larger, the GetManagedObjects responses are getting larger,
for no change that the user sees.  Said another way, we're throwing
away significantly more data.

>
>
> > Here the idea is to develop a new application server to serve the URIs which have strong latency requirements and route the rest of the URIs to bmcweb.
> >> This is the part I don't understand;  If the forwarding calls in this new server are blocking to bmcweb, what's the point of adding it?
> The backend for the thermal metrics and stats/counters metrics is different so we would not be blocked on the same service.

I find it unlikely that we'll be able to all agree on what is "high"
priority (some people see any http call is low priority), so unless
this was easily configurable what the priority is for a given route,
I'm not sure this kind of architecture would work.  My preference
would be to:
1. Optimize the callbacks that are taking too much time, and make them
call boost::asio::dispatch during long loops.
2. Use threads to ensure there's more resources available that don't
block (this only works up to N connections, whatever N is for thread
count).


Thanks for the discussion;  Let me know what you find.


> With a reverse proxy our idea is to forward thermal metrics to another backend application server (new bmcweb server or a similar lightweight server instance) and all other URIs to the existing bmcweb.
>
> I will share any further results from our internal tests as we tryout different things.
>
> Thanks
> Rohit PAI
>
>
> -----Original Message-----
> From: Ed Tanous <edtanous@google.com>
> Sent: Wednesday, May 24, 2023 9:57 PM
> To: Rohit Pai <ropai@nvidia.com>
> Cc: openbmc@lists.ozlabs.org
> Subject: Re: Prioritizing URIs with tight performance requirement in openBmc with bmcweb
>
> External email: Use caution opening links or attachments
>
>
> On Wed, May 24, 2023 at 2:36 AM Rohit Pai <ropai@nvidia.com> wrote:
> >
> > Hello All,
> >
> >
> >
> > We have a requirement in our platform to serve a few specific URI with a tight performance requirement on the turnaround time (latency).
> >
> > One such example is the telemetry sensor metric URI which has power, thermal data can have a max turnaround time of 500ms.
>
> What other constraints are here?  We're talking about a TCP protocol, running on a network, on a multi-process CPU.  Are these hard realtime requirements?  It's unlikely you're going to get hard realtime guarantees from Redfish.
>
> >
> >
> >
> > The current bmcweb design uses only a single thread to serve all URI requests/responses.
> >
> > If bmcweb is processing a huge amount of data (which is common for aggregation URIs) then other requests would get blocked and their latency time would get impacted.
>
> The bmcweb queuing flow looks something like:
>
>         A            B                                 C             D
> TCP─►TLS──►HTTP Connection─►DBus─►Daemon
>
> Which location are you seeing queuing problems?  Keep in mind, HTTP
> 1.1 can only process a single request/response at a time per connection, so if your system is trying to process things from a single connection at A, you're right, long requests will block short ones.
>
> >
> > Here I am referring to the time bmcweb takes to prepare the JSON response after it has got the data from the backend service.
>
> What is the ballpark for how big "huge amount" of data would be?  What
> processing is actually being done?   This would be the first time that
> json parsing itself has actually shown up on a performance profile, but with expand + aggregation, you're right, there's potential for that.
>
> One thing I've considered before is switching bmcweb over to boost::json, which can do incremental chunked parsing, unlike nlohmann, which would let us unblock the flows as each processes the data.
>
> >
> > In our platform, we see that power thermal metric URI can take more than 500ms when it’s requested in parallel to other aggregation URI which have huge response data.
>
> Can you share your test?  Is your test using multiple connections to ensure that the thermal metric is being pulled from a different connection than the aggregation URI?
>
> >
> >
> >
> > To solve this problem, we thought of a couple of solutions.
> >
> >
> >
> > To introduce multi-threading support in bmcweb.
>
> Sure, I have no problem with adding threads, and it really wouldn't be tough to accomplish as a test:
> 1. Link pthreads in meson.  Make this a meson option so platforms that don't need multiple threads can opt out of it.
> 2. go to each async_read and async_write call, and ensure that they are using a strand (to keep processing on the same thread for any one call).
> 3. Locate all of the global and cross connection data structures, and add a mutex to each of them.  One of the global data structures is the Dbus connection itself, so if your performance problem exists on C or D above, it will likely still exist with multiple threads.
> 4. Update sdbusplus asio connection to support strands, ensuring that the callbacks happen on the same thread they're requested.
> Alternatively, just set up a dbus connection per thread.
> 5. Test heavily to make sure we don't have threading access problems or missing mutexes.
> 6. Update the DEVELOPING.md doc to account for multiple threads in the way we review code. (reentrancy, etc).  Most of the existing code should be reentrant, but it's worth looking.
> There's likely a few other minor things that would need fixed, but the above is the general gist.
>
> >
> > Does anyone have any experience/feedback on making this work?
> >
> > Is there any strong reason not to have multi-threading support in bmcweb other than general guidelines to avoid threads?
>
> It increases the binary size beyond what can fit on a lot of BMCs (about 10-20%) This is fine so long as you keep it as a compile option so people can opt into threading support.  Historically, teaching and reviewing multi-threaded code has been an order of magnitude more difficult than single threaded code, so keeping the single thread significantly improves the review process, so please plan on having folks prepared to review code for multi-threaded correctness.
>
> >
> >
> >
> > To use a reverse proxy like nginx as the front end to redirect a few URIs to a new application server.
>
> Please take a look at the OpenBMC tree around 2018-2019.  There were several platforms that formerly used nginx as the front end to bmcweb, and have since dropped it.  There was also a discussion on discord recently you might look at.  I'm not really sure how nginx would solve your problem though.  The bmcweb reactor design looks similar to nginx (we use asio, they use libuv) already, so it's not clear to me what you would gain here, unless you were running multiple processes of bmcweb?  Keep in mind, there'd need to be some sort of shared state in that case, so you have to do #3 in the above anyway.
>
> >
> > Here the idea is to develop a new application server to serve the URIs which have strong latency requirements and route the rest of the URIs to bmcweb.
>
> This is the part I don't understand;  If the forwarding calls in this new server are blocking to bmcweb, what's the point of adding it?
> Feel free to just show the code of this working as well.
>
> >
> >        Has anyone experienced any limitations with nginx on openBmc platforms (w.r.t performance, memory footprint, etc)?
> >
> >        We also have the requirement to support SSE, Is there any known limitation to make such a feature work with nginx?
>
> It can be made to work.  AuthX tends to be the harder part, as implementing CSRF for SSE or websockets is a huge pain.
>
> >
> >
> >
> >
> >
> > Any other suggestion or solution to the problem we are solving to meet our performance requirement with bmcweb?
>
> 1. Audit your code for any blocking calls.  If you have any, put them into a loop, process X bytes at a time, while calling boost::asio::dispatch in between, to not starve the other tasks.
> 2. Move the bmcweb core to a json library that can do incremental serialization/deserialization.  boost::json would be my first choice.
> 3. I have patches to turn on uring, which lets us use boost::asio::random_access_file to fix #1 for blocking filesystem calls.
> 4. Set reasonable limits on the max aggregation size that is allowed at a system level.  There were some proposals on gerrit.
>
>
> I would be worried about separating code into two classes (high priority/low priority) because everyone's opinions will differ about what should be "high" priority, and what should be "low" priority.  If that isn't configurable easily, I worry that we're going to have problems agreeing on priority, and I don't want to be in a situation where every developer is having to make priority calls for every system class.  I'm open to the possibility here, but we need to make sure it keeps code moving.
>
> >
> >
> >
> >
> >
> > Thanks
> >
> > Rohit PAI
> >
> >

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Prioritizing URIs with tight performance requirement in openBmc with bmcweb
  2023-05-24 17:56   ` Ed Tanous
@ 2023-06-01 18:33     ` Ed Tanous
  2023-06-03  8:49       ` Rohit Pai
  0 siblings, 1 reply; 9+ messages in thread
From: Ed Tanous @ 2023-06-01 18:33 UTC (permalink / raw)
  To: Rohit Pai; +Cc: openbmc

On Wed, May 24, 2023 at 10:56 AM Ed Tanous <edtanous@google.com> wrote:
>
> On Wed, May 24, 2023 at 9:26 AM Ed Tanous <edtanous@google.com> wrote:
> >
> > There's likely a few other minor things that would need fixed, but the
> > above is the general gist.
>
> I spent an hour or so and put together a simple POC of multithreading
> bmcweb.  It's not threadsafe for all the global structures (although I
> did one as an example), and has a lot of safety issues we'll need to
> work through, but will give us a hint about whether multi-threading
> bmcweb will solve your performance problem:
> https://gerrit.openbmc.org/c/openbmc/bmcweb/+/63710
>
> PTAL

Rohit, Were you able to try this?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Prioritizing URIs with tight performance requirement in openBmc with bmcweb
  2023-06-01 18:33     ` Ed Tanous
@ 2023-06-03  8:49       ` Rohit Pai
  2023-06-08 17:18         ` Ed Tanous
  0 siblings, 1 reply; 9+ messages in thread
From: Rohit Pai @ 2023-06-03  8:49 UTC (permalink / raw)
  To: Ed Tanous; +Cc: openbmc

Hello Ed, 

We have been analyzing time spent in route handler code to get some time measurements with different approaches. 
At least we know from our tests that majority of the time is spent in route handlers and not in JSON serialization which happens outside. 

To recap our test, we have two http clients doing thermal metric and stats/counters metric URI polling simultaneously from the BMC. 
Thermal metric URI has around 100 sensors and has tight latency perf requirement of 500ms. 
Stats/counter metric URI has around 2500 properties to fetch from the backend which uses the GetManagedObjects API. 
Time analysis was done on the latency measurement of stats/counter URI as this impacts the latency of thermal metric URI with the current bmcweb single threaded nature. 

Method 1 - Object Manger call to the backend service, bmcweb handler code processes the response and prepares the required JSON objects. 
a. Backend dbus call turnaround time                                              - 584 ms 
b. Logic in bmcweb route handler code to prepare response      - 365 ms 
c. Total URI latency                                                                               - 1019 ms

Method 2 - Backend populates all the needed properties in a single aggregate property. 
a. Backend dbus call turnaround time                                              - 161 ms 
b. Logic in bmcweb route handler code to prepare response      - 71   ms 
c. Total URI latency                                                                               - 291 ms

Method 3 - Bmcweb reads all the properties from a file fd. Here goal is to eliminate latency and load coming by using dbus as an IPC for large payloads. 
a. fd read call in bmcweb                                                                     - 64 ms 
b. JSON objection population from the read file contents             - 96 ms 
c. Total URI latency                                                                                - 254 ms 
The file contents were in JSON format. If we can replace this with efficient data structure which can be used with fd passing, then I think we can further optimize point b. 
Optimization around CPU bound logic in handler code would certainly help the latency of the other requests pending in the queue. 

I will try the multi-threaded solution put by you in the coming days and share the results. 

Thanks 
Rohit PAI 

-----Original Message-----
From: Ed Tanous <edtanous@google.com> 
Sent: Friday, June 2, 2023 12:04 AM
To: Rohit Pai <ropai@nvidia.com>
Cc: openbmc@lists.ozlabs.org
Subject: Re: Prioritizing URIs with tight performance requirement in openBmc with bmcweb

External email: Use caution opening links or attachments

On Wed, May 24, 2023 at 10:56 AM Ed Tanous <edtanous@google.com> wrote:
>
> On Wed, May 24, 2023 at 9:26 AM Ed Tanous <edtanous@google.com> wrote:
> >
> > There's likely a few other minor things that would need fixed, but 
> > the above is the general gist.
>
> I spent an hour or so and put together a simple POC of multithreading 
> bmcweb.  It's not threadsafe for all the global structures (although I 
> did one as an example), and has a lot of safety issues we'll need to 
> work through, but will give us a hint about whether multi-threading 
> bmcweb will solve your performance problem:
> https://gerrit.openbmc.org/c/openbmc/bmcweb/+/63710
>
> PTAL

Rohit, Were you able to try this?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Prioritizing URIs with tight performance requirement in openBmc with bmcweb
  2023-06-03  8:49       ` Rohit Pai
@ 2023-06-08 17:18         ` Ed Tanous
  2023-06-23 12:19           ` Rohit Pai
  0 siblings, 1 reply; 9+ messages in thread
From: Ed Tanous @ 2023-06-08 17:18 UTC (permalink / raw)
  To: Rohit Pai; +Cc: openbmc

On Sat, Jun 3, 2023 at 1:49 AM Rohit Pai <ropai@nvidia.com> wrote:
>
> Hello Ed,

The below is all really great data

>
> Thermal metric URI has around 100 sensors and has tight latency perf requirement of 500ms.
> Stats/counter metric URI has around 2500 properties to fetch from the backend which uses the GetManagedObjects API.
> Time analysis was done on the latency measurement of stats/counter URI as this impacts the latency of thermal metric URI with the current bmcweb single threaded nature.

What two queries are you testing with?

>
>
>
> Method 1 - Object Manger call to the backend service, bmcweb handler code processes the response and prepares the required JSON objects.
> a. Backend dbus call turnaround time                                              - 584 ms

This is quite high.  Have you looked at reducing this?  This would
imply that you're doing blocking calls in your backend daemon.

> b. Logic in bmcweb route handler code to prepare response      - 365 ms

This could almost certainly be reduced with some targeted things.  You
didn't answer me on whether you're using TLS in this example, so I'm
going to assume you're not from your numbers.  I would've expected
crypto to be a significant part of your profile.

> c. Total URI latency                                                                               - 1019 ms

a + b != c.  Is the rest the time spent writing to the socket?  What
is the extra time?

>
> Method 2 - Backend populates all the needed properties in a single aggregate property.
> a. Backend dbus call turnaround time                                              - 161 ms

This is still higher than I would like to see, but in the realm of
what I would expect.

> b. Logic in bmcweb route handler code to prepare response      - 71   ms

I would've expected to see this in single digit ms for a single
property.  Can you profile here and see what's taking so long?

> c. Total URI latency                                                                               - 291 ms
>
> Method 3 - Bmcweb reads all the properties from a file fd. Here goal is to eliminate latency and load coming by using dbus as an IPC for large payloads.
> a. fd read call in bmcweb                                                                     - 64 ms

This is roughly equivalent to the dbus call, so if we figure out where
the bottleneck is in method 1B from the above, we could probably get
this comperable.

> b. JSON objection population from the read file contents             - 96 ms

This seems really high.

> c. Total URI latency                                                                                - 254 ms
> The file contents were in JSON format. If we can replace this with efficient data structure which can be used with fd passing, then I think we can further optimize point b.

In Method 3 you've essentially invented a new internal OpenBMC API.  I
would love to foster discussions of how to handle that, but we need to
treat it holistically in the system, and understand how:
1. The schemas will be managed
2. Concurrency will be managed
3. Blocking will be managed (presumably you did a blocking filesystem
read to get the data)

I'm happy to have those discussions, and the data you have above is
interesting, but any sort of change would require much larger project
buy-in.

> Optimization around CPU bound logic in handler code would certainly help the latency of the other requests pending in the queue.

Is it CPU bound or Memory bandwidth bound?  Most of the time I've seen
the latter.  How did you collect the measurements on cpu versus IO
versus memory bound?

>
> I will try the multi-threaded solution put by you in the coming days and share the results.
>

Sounds good.  Thanks for the input.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: Prioritizing URIs with tight performance requirement in openBmc with bmcweb
  2023-06-08 17:18         ` Ed Tanous
@ 2023-06-23 12:19           ` Rohit Pai
  0 siblings, 0 replies; 9+ messages in thread
From: Rohit Pai @ 2023-06-23 12:19 UTC (permalink / raw)
  To: 'Ed Tanous'; +Cc: openbmc

Hello Ed, 

> What two queries are you testing with?
There are MetericReport URIs 
http://${BMC}/redfish/v1/TelemetryService/MetricReports/
There is no crypto involved, I am using HTTP only.  

> a. Backend dbus call turnaround time                                              - 584 ms
> This is quite high.  Have you looked at reducing this?  This would imply that you're doing blocking calls in your backend daemon.
There is no simple way to reduce this because we have around 2000 properties spread across different objects. So, the dbus call is making GetManagedObjects which will trigger several internal get handlers. 
One solution we have tried is to put all the needed info in a single aggregate property and get this via single get property call. Even with that approach the dbus round trip time is 100+ ms for 2000 sensor data. 
The time taken by bmcweb to prepare the response was mainly impact by other services in the system which were taking considerable CPU bandwidth. 

> FD passing technique between backend and bmcweb 
> I'm happy to have those discussions, and the data you have above is interesting, but any sort of change would require much larger project buy-in.
What we have learnt from our experiment is that there is no way make this solution robust without explicit locking mechanism. 
In our POC we had a backend service which writes 500KB data to a file every 500ms and bmcweb would read this file fd in the route handler. When we run this code without any explicit locking or synchronization, we ran into issues related to data consistency. Its hard to ensure that the reader would always get correct wholesome data even if this is done in block read/write manner. 
If we put reader/writer locks, then I am thinking the performance might come close to dbus IPC. 

> Optimization around CPU bound logic in handler code would certainly help the latency of the other requests pending in the queue.
> Is it CPU bound, or Memory bandwidth bound?  Most of the time I've seen the latter.  How did you collect the measurements on cpu versus IO versus memory bound?
All time tracking is done via explicit logging in bmcweb. So we cant really tell whether its CPU OR memory bandwidth but we know if the CPU is waiting on IO job or it's just compute logic which is taking time. 


 > I will try the multi-threaded solution put by you in the coming days and share the results.
When I tried the patch, I did not see any improvement. I also saw that bmcweb was still single threaded. There was only one entry under /proc/<bmcweb-pid>/task
In the patch we only pass thread count parameter to boost IO context. As per my reading this does not create the threads rather only sets the concurrency hint for io_context. 
I added below piece of code to the webserver_main.cpp file to creates needed threads and call io.run() on each one of them. 

.......
void runIOService() {
     boost::asio::io_context& io = crow::connections::getIoContext();
     io.run();
}
.......
        // Create a vector of threads
    std::vector<std::thread> threads;

    // Create and launch the threads
    for (unsigned int i = 0; i < 4; ++i) {
        threads.emplace_back(runIOService);
    }

    // Wait for all threads to finish
    for (auto& thread : threads) {
        thread.join();
    }
.........

This seems to create the needed threads we want and at unit test level it worked but when we ran stress test to loop few URIs in multiple concurrent clients its breaking. 
There is no crash dump from bmcweb but it just appears to be hung and goes to non-responsive state. Need to dig deeper if there is any sort of deadlocks happening. 
Let me know if you have already found any fix from your testing or in general have any immediate thoughts on from where it might be coming. 

Thanks 
Rohit 



-----Original Message-----
From: Ed Tanous <edtanous@google.com> 
Sent: Thursday, June 8, 2023 10:49 PM
To: Rohit Pai <ropai@nvidia.com>
Cc: openbmc@lists.ozlabs.org
Subject: Re: Prioritizing URIs with tight performance requirement in openBmc with bmcweb

External email: Use caution opening links or attachments


On Sat, Jun 3, 2023 at 1:49 AM Rohit Pai <ropai@nvidia.com> wrote:
>
> Hello Ed,

The below is all really great data

>
> Thermal metric URI has around 100 sensors and has tight latency perf requirement of 500ms.
> Stats/counter metric URI has around 2500 properties to fetch from the backend which uses the GetManagedObjects API.
> Time analysis was done on the latency measurement of stats/counter URI as this impacts the latency of thermal metric URI with the current bmcweb single threaded nature.

What two queries are you testing with?

>
>
>
> Method 1 - Object Manger call to the backend service, bmcweb handler code processes the response and prepares the required JSON objects.
> a. Backend dbus call turnaround time                                              - 584 ms

This is quite high.  Have you looked at reducing this?  This would imply that you're doing blocking calls in your backend daemon.

> b. Logic in bmcweb route handler code to prepare response      - 365 ms

This could almost certainly be reduced with some targeted things.  You didn't answer me on whether you're using TLS in this example, so I'm going to assume you're not from your numbers.  I would've expected crypto to be a significant part of your profile.

> c. Total URI latency                                                                               - 1019 ms

a + b != c.  Is the rest the time spent writing to the socket?  What is the extra time?

>
> Method 2 - Backend populates all the needed properties in a single aggregate property.
> a. Backend dbus call turnaround time                                              - 161 ms

This is still higher than I would like to see, but in the realm of what I would expect.

> b. Logic in bmcweb route handler code to prepare response      - 71   ms

I would've expected to see this in single digit ms for a single property.  Can you profile here and see what's taking so long?

> c. Total URI latency                                                                               - 291 ms
>
> Method 3 - Bmcweb reads all the properties from a file fd. Here goal is to eliminate latency and load coming by using dbus as an IPC for large payloads.
> a. fd read call in bmcweb                                                                     - 64 ms

This is roughly equivalent to the dbus call, so if we figure out where the bottleneck is in method 1B from the above, we could probably get this comperable.

> b. JSON objection population from the read file contents             - 96 ms

This seems really high.

> c. Total URI latency                                                                                - 254 ms
> The file contents were in JSON format. If we can replace this with efficient data structure which can be used with fd passing, then I think we can further optimize point b.

In Method 3 you've essentially invented a new internal OpenBMC API.  I would love to foster discussions of how to handle that, but we need to treat it holistically in the system, and understand how:
1. The schemas will be managed
2. Concurrency will be managed
3. Blocking will be managed (presumably you did a blocking filesystem read to get the data)

I'm happy to have those discussions, and the data you have above is interesting, but any sort of change would require much larger project buy-in.

> Optimization around CPU bound logic in handler code would certainly help the latency of the other requests pending in the queue.

Is it CPU bound or Memory bandwidth bound?  Most of the time I've seen the latter.  How did you collect the measurements on cpu versus IO versus memory bound?

>
> I will try the multi-threaded solution put by you in the coming days and share the results.
>

Sounds good.  Thanks for the input.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-06-23 12:21 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-24  9:35 Prioritizing URIs with tight performance requirement in openBmc with bmcweb Rohit Pai
2023-05-24 16:26 ` Ed Tanous
2023-05-24 17:56   ` Ed Tanous
2023-06-01 18:33     ` Ed Tanous
2023-06-03  8:49       ` Rohit Pai
2023-06-08 17:18         ` Ed Tanous
2023-06-23 12:19           ` Rohit Pai
2023-05-26 15:16   ` Rohit Pai
2023-05-26 16:29     ` Ed Tanous

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.