All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Hight Processor time of Socket communciation
@ 2017-04-18 16:19 Jiahuan Zhang
  2017-04-18 16:26 ` Peter Maydell
  0 siblings, 1 reply; 10+ messages in thread
From: Jiahuan Zhang @ 2017-04-18 16:19 UTC (permalink / raw)
  To: QEMU Developers

[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]

Dear QEMU developers,
I am measuring the processor time for guest-host communication via socket.
The guest app is to write a 5M image to a serial device.
The serial deivce is redirected to the socket in the command line.
The host app is to receive the data via socket until the peer closes the
connection.
Please find in the attachment the Processor time graph generated by Windows
Performance Monitor.

The graph shows the processor time is almost 100% while communicating.
Surprising me! My expectation is 1%.

I wonder if this is the right performance for QEMU socket communciation? Or
this high processor time is caused by the serial device? If so, any
optimization I can do?

Here is the QEMU command-line i used.
$ qemu/build/arm-softmmu/qemu-system-arm.exe -M vexpress-a9 -kernel
zImage_vexpress_4-10 -dtb vexpress-v2p-ca9.dtb -initrd rootfs.img.gz
-append "console=ttyAMA0 root=/dev/ram rdinit=linuxrc" -chardev
socket,host=localhost,port=27015,server,nowait,id=char1 -serial
telnet:localhost:5555,server,nowait -serial stdio -serial stdio -serial
chardev:char1 -monitor telnet:localhost:4444,server,nowait -sd test.img
-nographic

Please correct me if something is wrong.

Best rergards,

Huan

[-- Attachment #2: socket_guest.png --]
[-- Type: image/png, Size: 74842 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Hight Processor time of Socket communciation
  2017-04-18 16:19 [Qemu-devel] Hight Processor time of Socket communciation Jiahuan Zhang
@ 2017-04-18 16:26 ` Peter Maydell
  2017-04-19  8:56   ` Jiahuan Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Maydell @ 2017-04-18 16:26 UTC (permalink / raw)
  To: Jiahuan Zhang; +Cc: QEMU Developers

On 18 April 2017 at 17:19, Jiahuan Zhang <jiahuanzhang90@gmail.com> wrote:
> Dear QEMU developers,
> I am measuring the processor time for guest-host communication via socket.
> The guest app is to write a 5M image to a serial device.
> The serial deivce is redirected to the socket in the command line.
> The host app is to receive the data via socket until the peer closes the
> connection.
> Please find in the attachment the Processor time graph generated by Windows
> Performance Monitor.
>
> The graph shows the processor time is almost 100% while communicating.
> Surprising me! My expectation is 1%.
>
> I wonder if this is the right performance for QEMU socket communciation? Or
> this high processor time is caused by the serial device? If so, any
> optimization I can do?

The serial device on the vexpress-a9 model is a PL011, which
is a fairly simple UART which all data must be written to
byte-at-a-time. This is never going to be fast, because we
have to execute a lot of guest code to send the data through
this byte-at-a-time bottleneck, and since you're running
a purely emulated QEMU, executing guest code means doing
a lot of CPU operations.

You will likely get better throughput if you use the 'virt' board
where you can use the virtio-serial device which can send
data more efficiently.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Hight Processor time of Socket communciation
  2017-04-18 16:26 ` Peter Maydell
@ 2017-04-19  8:56   ` Jiahuan Zhang
  2017-04-19  9:15     ` Peter Maydell
  0 siblings, 1 reply; 10+ messages in thread
From: Jiahuan Zhang @ 2017-04-19  8:56 UTC (permalink / raw)
  To: Peter Maydell; +Cc: QEMU Developers

On 18 April 2017 at 18:26, Peter Maydell <peter.maydell@linaro.org> wrote:

> On 18 April 2017 at 17:19, Jiahuan Zhang <jiahuanzhang90@gmail.com> wrote:
> > Dear QEMU developers,
> > I am measuring the processor time for guest-host communication via
> socket.
> > The guest app is to write a 5M image to a serial device.
> > The serial deivce is redirected to the socket in the command line.
> > The host app is to receive the data via socket until the peer closes the
> > connection.
> > Please find in the attachment the Processor time graph generated by
> Windows
> > Performance Monitor.
> >
> > The graph shows the processor time is almost 100% while communicating.
> > Surprising me! My expectation is 1%.
> >
> > I wonder if this is the right performance for QEMU socket communciation?
> Or
> > this high processor time is caused by the serial device? If so, any
> > optimization I can do?
>
> The serial device on the vexpress-a9 model is a PL011, which
> is a fairly simple UART which all data must be written to
> byte-at-a-time. This is never going to be fast, because we
> have to execute a lot of guest code to send the data through
> this byte-at-a-time bottleneck, and since you're running
> a purely emulated QEMU, executing guest code means doing
> a lot of CPU operations.
>

Hi Peter,
Do you mean that it is reasonable for QEMU emulation consumes high CPU time
when doing host-guest interaction, since the interaction calls many QEMU
codes in the background?

The situation i met is that,
1. after socket connection is done and i enter the guest kernel console,
QEMU's processor time is very low although some some callback functiona are
polling.
2. when i start the guest app to send data to the serial device, which is
redirected to the socket,
the processor time becomes very high.
3. when the data transfer is done, the processor time recovers to be low
again.

Since my guest app is rather simple and no while() is included, according
to your words,
can I conclude that the high processor time is cause by the callbacks for
guest to host data transfer?


> You will likely get better throughput if you use the 'virt' board
> where you can use the virtio-serial device which can send
> data more efficiently.
>

Here, can I understand your statement in this way,
a transmit buffer in the serial device for guest to host data transfer
may reduce the processor time, and in turn, increase the throughput?
Because the transmit buffer can enable multi-byte data to be transfered.
Then, taking the char-socket as an example, less tcp_char_write will be
called
when the "len" varible is larger than 1.

Please rectify me if my logic is wrong. Thanks.
Regards,
huan


> thanks
> -- PMM
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Hight Processor time of Socket communciation
  2017-04-19  8:56   ` Jiahuan Zhang
@ 2017-04-19  9:15     ` Peter Maydell
  2017-04-19  9:25       ` Jiahuan Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Maydell @ 2017-04-19  9:15 UTC (permalink / raw)
  To: Jiahuan Zhang; +Cc: QEMU Developers

On 19 April 2017 at 09:56, Jiahuan Zhang <jiahuanzhang90@gmail.com> wrote:
> Do you mean that it is reasonable for QEMU emulation consumes high CPU time
> when doing host-guest interaction, since the interaction calls many QEMU
> codes in the background?

> Since my guest app is rather simple and no while() is included, according to
> your words,
> can I conclude that the high processor time is cause by the callbacks for
> guest to host data transfer?

What is happening is that the guest kernel's serial driver
has a loop that (simplified) looks like this:

  do {
      if (pl011_read(REG_FR) & FR_TXFF)
          break; /* fifo full, try again later */
      pl011_write(buffer[x], REG_DR);  /* send one byte */
      x++;
  } while (x != len);

This is a lot of guest CPU instructions (and two callouts
to QEMU's device emulation) for every single byte.

>> You will likely get better throughput if you use the 'virt' board
>> where you can use the virtio-serial device which can send
>> data more efficiently.

> Here, can I understand your statement in this way,
> a transmit buffer in the serial device for guest to host data transfer
> may reduce the processor time, and in turn, increase the throughput?

The reason virtio-serial is faster is because the guest
kernel driver can essentially tell QEMU
 "the data is in guest memory at address X length L"
and then QEMU takes all that data at once. This is much
more efficient.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Hight Processor time of Socket communciation
  2017-04-19  9:15     ` Peter Maydell
@ 2017-04-19  9:25       ` Jiahuan Zhang
  2017-04-19  9:55         ` Peter Maydell
  0 siblings, 1 reply; 10+ messages in thread
From: Jiahuan Zhang @ 2017-04-19  9:25 UTC (permalink / raw)
  To: Peter Maydell; +Cc: QEMU Developers

On 19 April 2017 at 11:15, Peter Maydell <peter.maydell@linaro.org> wrote:

> On 19 April 2017 at 09:56, Jiahuan Zhang <jiahuanzhang90@gmail.com> wrote:
> > Do you mean that it is reasonable for QEMU emulation consumes high CPU
> time
> > when doing host-guest interaction, since the interaction calls many QEMU
> > codes in the background?
>
> > Since my guest app is rather simple and no while() is included,
> according to
> > your words,
> > can I conclude that the high processor time is cause by the callbacks for
> > guest to host data transfer?
>
> What is happening is that the guest kernel's serial driver
> has a loop that (simplified) looks like this:
>
>   do {
>       if (pl011_read(REG_FR) & FR_TXFF)
>           break; /* fifo full, try again later */
>       pl011_write(buffer[x], REG_DR);  /* send one byte */
>       x++;
>   } while (x != len);
>
> This is a lot of guest CPU instructions (and two callouts
> to QEMU's device emulation) for every single byte.
>
> Hi, no, I am not using any kernel driver and I only test the guest to host
data transfer.
I need the kernel transparent guest-host communication.
The code is in this way.
/***************************************************************/*
int main() {
const char *filename = "image_set/Snake_River_(5mb).jpg";
 /* get the file size */
 struct stat buf;
 uint32_t zero = stat(filename, &buf);
 if (zero == 0)
  printf("image size = %d \n", buf.st_size);
 else
  printf("stat() failed");
 /* open the file */
 uint32_t fd = open(filename, O_RDONLY);
 if(!fd){
  printf("could not open file.\n");
  close(fd);
  return 0;
 }
/* read file into s */
 while((ret_in = read(fd, &s[0], BUF_SIZE)) > 0){
  write_to_uart(s, ret_in);
 }
}

/*
 * write_to_uart(): write data to serial port
 */
void write_to_uart(char* out, uint32_t writeSize){
 int i;
 for(i=0; i<writeSize; i++){
  *UART1 =(unsigned char)(*(out+i));
 }
}

regards,
Huan


> >> You will likely get better throughput if you use the 'virt' board
> >> where you can use the virtio-serial device which can send
> >> data more efficiently.
>
> > Here, can I understand your statement in this way,
> > a transmit buffer in the serial device for guest to host data transfer
> > may reduce the processor time, and in turn, increase the throughput?
>
> The reason virtio-serial is faster is because the guest
> kernel driver can essentially tell QEMU
>  "the data is in guest memory at address X length L"
> and then QEMU takes all that data at once. This is much
> more efficient.
>
> thanks
> -- PMM
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Hight Processor time of Socket communciation
  2017-04-19  9:25       ` Jiahuan Zhang
@ 2017-04-19  9:55         ` Peter Maydell
  2017-04-19 10:04           ` Jiahuan Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Maydell @ 2017-04-19  9:55 UTC (permalink / raw)
  To: Jiahuan Zhang; +Cc: QEMU Developers

On 19 April 2017 at 10:25, Jiahuan Zhang <jiahuanzhang90@gmail.com> wrote:
> On 19 April 2017 at 11:15, Peter Maydell <peter.maydell@linaro.org> wrote:
>> What is happening is that the guest kernel's serial driver
>> has a loop that (simplified) looks like this:
>>
>>   do {
>>       if (pl011_read(REG_FR) & FR_TXFF)
>>           break; /* fifo full, try again later */
>>       pl011_write(buffer[x], REG_DR);  /* send one byte */
>>       x++;
>>   } while (x != len);
>>
>> This is a lot of guest CPU instructions (and two callouts
>> to QEMU's device emulation) for every single byte.
>>
> Hi, no, I am not using any kernel driver and I only test the guest to host
> data transfer.

OK, then the equivalent loop is this one:

> /*
>  * write_to_uart(): write data to serial port
>  */
> void write_to_uart(char* out, uint32_t writeSize){
>  int i;
>  for(i=0; i<writeSize; i++){
>   *UART1 =(unsigned char)(*(out+i));
>  }
> }

except that your code is broken because it's not
checking that the FIFO is ready to receive the character
so it will drop data sometimes.

The point is the same -- you're feeding the data to
the UART byte-at-a-time.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Hight Processor time of Socket communciation
  2017-04-19  9:55         ` Peter Maydell
@ 2017-04-19 10:04           ` Jiahuan Zhang
  2017-04-19 10:09             ` Peter Maydell
  0 siblings, 1 reply; 10+ messages in thread
From: Jiahuan Zhang @ 2017-04-19 10:04 UTC (permalink / raw)
  To: Peter Maydell; +Cc: QEMU Developers

On 19 April 2017 at 11:55, Peter Maydell <peter.maydell@linaro.org> wrote:

> On 19 April 2017 at 10:25, Jiahuan Zhang <jiahuanzhang90@gmail.com> wrote:
> > On 19 April 2017 at 11:15, Peter Maydell <peter.maydell@linaro.org>
> wrote:
> >> What is happening is that the guest kernel's serial driver
> >> has a loop that (simplified) looks like this:
> >>
> >>   do {
> >>       if (pl011_read(REG_FR) & FR_TXFF)
> >>           break; /* fifo full, try again later */
> >>       pl011_write(buffer[x], REG_DR);  /* send one byte */
> >>       x++;
> >>   } while (x != len);
> >>
> >> This is a lot of guest CPU instructions (and two callouts
> >> to QEMU's device emulation) for every single byte.
> >>
> > Hi, no, I am not using any kernel driver and I only test the guest to
> host
> > data transfer.
>
> OK, then the equivalent loop is this one:
>
> > /*
> >  * write_to_uart(): write data to serial port
> >  */
> > void write_to_uart(char* out, uint32_t writeSize){
> >  int i;
> >  for(i=0; i<writeSize; i++){
> >   *UART1 =(unsigned char)(*(out+i));
> >  }
> > }
>
> except that your code is broken because it's not
> checking that the FIFO is ready to receive the character
> so it will drop data sometimes.
>

Okay. Thank you for pointing this out.
I would like to make a new serial device based on pl011,
but containing a buffer for guest-to-host data transfer.
I expect it would help to reduce the processor time while communicating.
At this moment, I focus on having a better performance.

regards,
Jiahuan

>
> The point is the same -- you're feeding the data to
> the UART byte-at-a-time.
>
> thanks
> -- PMM
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Hight Processor time of Socket communciation
  2017-04-19 10:04           ` Jiahuan Zhang
@ 2017-04-19 10:09             ` Peter Maydell
  2017-04-19 13:34               ` Jiahuan Zhang
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Maydell @ 2017-04-19 10:09 UTC (permalink / raw)
  To: Jiahuan Zhang; +Cc: QEMU Developers

On 19 April 2017 at 11:04, Jiahuan Zhang <jiahuanzhang90@gmail.com> wrote:
> Okay. Thank you for pointing this out.
> I would like to make a new serial device based on pl011,
> but containing a buffer for guest-to-host data transfer.

As I've said, the PL011 is inherently byte at a time.
If you want better than that, I recommend you use virtio-serial,
because it already exists for this purpose.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Hight Processor time of Socket communciation
  2017-04-19 10:09             ` Peter Maydell
@ 2017-04-19 13:34               ` Jiahuan Zhang
  2017-04-19 20:03                 ` Peter Maydell
  0 siblings, 1 reply; 10+ messages in thread
From: Jiahuan Zhang @ 2017-04-19 13:34 UTC (permalink / raw)
  To: Peter Maydell; +Cc: QEMU Developers

On 19 April 2017 at 12:09, Peter Maydell <peter.maydell@linaro.org> wrote:

> On 19 April 2017 at 11:04, Jiahuan Zhang <jiahuanzhang90@gmail.com> wrote:
> > Okay. Thank you for pointing this out.
> > I would like to make a new serial device based on pl011,
> > but containing a buffer for guest-to-host data transfer.
>
> As I've said, the PL011 is inherently byte at a time.
> If you want better than that, I recommend you use virtio-serial,
> because it already exists for this purpose.
>
> thanks
> -- PMM
>

Hi Peter,

But from the source code, I found the main characteristic of virtio-serial
is that,
virtio-serial can create multiple serial ports and each port has a pair of
control virt-queues and
a pair of guest input/output virt-queues.
Its "have_data" callback function enables multi-byte data transfer from
guest.
And it is not in a tranditional device emulation format.
I mean that no exact IO region is emulated for it.
Without Linux kernel driver, I don't know how to manipulate it.
For the time being, I enable the guest app to send data to
the pl011's data register directly via a pointer, as you see that in the
code above.

This is why I am thinking if adding a transmit buffer in pl011 for guest
writing
is a feasible alternative.

Any suggestion is welcome.
Regards,
Huan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] Hight Processor time of Socket communciation
  2017-04-19 13:34               ` Jiahuan Zhang
@ 2017-04-19 20:03                 ` Peter Maydell
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Maydell @ 2017-04-19 20:03 UTC (permalink / raw)
  To: Jiahuan Zhang; +Cc: QEMU Developers

On 19 April 2017 at 14:34, Jiahuan Zhang <jiahuanzhang90@gmail.com> wrote:
> But from the source code, I found the main characteristic of virtio-serial
> is that,
> virtio-serial can create multiple serial ports and each port has a pair of
> control virt-queues and
> a pair of guest input/output virt-queues.

Its main characteristic is that the data transfer is over
the standard virtio channel (ie a ring buffer in guest
memory). That's why it's fast.

> Its "have_data" callback function enables multi-byte data transfer from
> guest.
> And it is not in a tranditional device emulation format.
> I mean that no exact IO region is emulated for it.

It's a PCI device, it has registers the same way PCI
devices do. (Or for virtio-mmio, it has MMIO registers
like other MMIO devices).

> Without Linux kernel driver, I don't know how to manipulate it.

You can look at the virtio spec if you want to do direct
work with virtio devices. Programming one isn't any
more complicated than any other high-data-transfer
device (like a modern ethernet card, for instance).

> For the time being, I enable the guest app to send data to
> the pl011's data register directly via a pointer, as you see that in the
> code above.
>
> This is why I am thinking if adding a transmit buffer in pl011 for guest
> writing
> is a feasible alternative.

Well, anything like this is never going to be of interest
to QEMU upstream, because PL011s don't work like that.
Virtio is our answer to "what is the most efficient way
to transfer data from a guest to a host in a virtual
machine", so we don't need to reinvent that wheel.
The chances are that ad-hoc modifications to the PL011
will end up being slower than virtio.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2017-04-19 20:03 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-18 16:19 [Qemu-devel] Hight Processor time of Socket communciation Jiahuan Zhang
2017-04-18 16:26 ` Peter Maydell
2017-04-19  8:56   ` Jiahuan Zhang
2017-04-19  9:15     ` Peter Maydell
2017-04-19  9:25       ` Jiahuan Zhang
2017-04-19  9:55         ` Peter Maydell
2017-04-19 10:04           ` Jiahuan Zhang
2017-04-19 10:09             ` Peter Maydell
2017-04-19 13:34               ` Jiahuan Zhang
2017-04-19 20:03                 ` Peter Maydell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.