benchmark

thread starvation

March 05, 2007

While running several benchmarks I saw that http_load was getting time outs for some of the connections. This has been seen with all web-servers and different backends in lighttpd as well.

After patching http_load to handle timed out and byte count errors correctly, I could easily separate the time outs from other problems. In one of the last changesets I added a infrastructure to track the time spent in lighttpd for a single connection including the time spent in the different stages of the gthread-aio backend:

scheduling the threaded-read()
starting the read() in the thread
waiting until it is finish
sending the result into the main-loop
writing the buffered data to the socket

You can enable this timing by setting the define LOG_TIMING.

network_gthread_aio.c.78: (trace) write-start: 1173101341.237616 read-queue-wait: 680 ms read-time: 0 ms write-time: 21 ms
network_gthread_aio.c.78: (trace) write-start: 1173101341.229014 read-queue-wait: 576 ms read-time: 0 ms write-time: 134 ms
network_gthread_aio.c.78: (trace) write-start: 1173101341.240815 read-queue-wait: 681 ms read-time: 6 ms write-time: 19 ms

I wrote a script to extract this timing data from errorlog and used gnuplot to turn it to a images:

#!/bin/sh

## parse the errorlog for the lines from the timing_print
## - extract the numbers
## - sort it by start-time
## - make all timestamps relative to the first start-time

cat benchmark.error.log | \
    grep "network_gthread_aio.c.78:" | \
    awk '{ print $4,$6,$9,$12 }'  | \
    sort -n | \
    perl -ne '@e=split / /;if (!$start) { $start = $e[0]; } $e[0] = $e[0] - $start; print join " ", @e; ' > $1.data

cat <<EOF
set autoscale
set terminal png
set xlabel "start-time of a request"
set ylabel "ms per request"
set yrange [0:30000]

set title "$1"
set output "$1.png"
plot \
"$1.data" using 1:2 title "read-queue-wait-time" with points ps 0.2, \
"$1.data" using 1:(\$2 + \$3) title "read-time" with points ps 0.2, \
"$1.data" using 1:(\$2 + \$3 + \$4) title "write-time" with dots

set title "$1 (read-queue-wait)"
set output "$1-read-queue-wait.png"
plot \
"$1.data" using 1:2 title "read-queue-wait-time" with points ps 0.8

set title "$1 (read)"
set output "$1-read.png"
plot \
"$1.data" using 1:3 title "read-time" with points ps 0.8 pt 2

set title "$1 (write)"
set output "$1-write.png"
plot \
"$1.data" using 1:4 title "write-wait-time" with points ps 0.8 pt 3

EOF

The first benchmark was taking:

./http_load -parallel 100 -fetches 500 ./http-load.10M.urls-10G

and

server.max-read-threads = 64
## compiled with 64k read-ahead

The time spent with read()ing the data from the disk goes up:

64 threads

more detailed:

64 threads, read() calls

If you reduce the threads to 4 you get:

4 threads

and the read-time drops to:

4 threads, read() calls

Interesting for our time-outs are only the dots which leave the 4 sec range as that are our starving read() threads. If it takes too long for them to finish the client will close the connection and the user will get a broken transfer.

Reducing the number of threads helps to limit the impact of the problem as we can see above in the graphs:

threads-runnable = threads-started - threads-blocked

The probability of a stuck thread to get CPU-time again is increasing the more threads are getting stuck as less threads can run. In the worst case all available threads are waiting and at least one of them will get finished.

more graphs

4 read-threads, 10MByte files, 64kbytes read-ahead
all
read-queue-wait
4 read-threads, 10MByte files, 1MBytes read-ahead
all
read-queue-wait
read
write
64 read-threads, 10MByte files, 64kbytes read-ahead
all
read-queue-wait
read
write
64 read-threads, 10MByte files, 1MBytes read-ahead
all
read
write

Rule of thumb

Keep max-threads at twice the number of disk.

buffered IO performance

February 11, 2007

Next to the raw-IO performance which is important for heavy, static file transfers the buffered IO performance is more interesting for sites which have a small set of static files which can be kept in the fs-caches.

As we are using hot-caches for this benchmark the “lightness” of the server becomes important. The less syscalls it has to do, the better.

The test-case is made up of 100MByte of files in the size of 10MByte and 100kByte.

Benchmark

100kByte

100MByte of 100kBytes files served from the hot caches:

lighttpd
backend	MByte/s	req/s	user + sys
writev	82.20	802.71	90%
linux-sendfile	70.27	686.32	56%
gthread-aio	75.39	736.23	98%
posix-aio	73.10	713.88	98%
linux-aio-sendfile	31.32	305.90	35%
others
Apache 2.2.4 (event)	70.28	686.38	60%
LiteSpeed 3.0rc2	70.20	685.65	50%

linux-aio-sendfile is loosing most of its performance as it has to use O_DIRECT to operation which always is a unbuffered read.
Apache, LiteSpeed and linux-sendfile are using the same syscall: sendfile() and end up with the same performance values
gthread-aio and posix-aio perform better than sendfile()
write() performs better than the threaded AIO and sendfile() I can’t explain that right now :)

10MByte

100MByte of 10MBytes files served from the hot caches. The benchmark command has been changed as in the other benchmarks:

$ http_load -verbose -timeout 40 -parallel 100 -fetches 500 http-load.10M.urls-100M

http_load is doing a hard cut when we are using the -seconds option and we might lose some MByte/s due to incomplete transfers.

lighttpd
backend	MByte/s	req/s	user + sys
writev	82.20	8.76	80%
linux-sendfile	53.95	5.65	40%
gthread-aio	83.02	8.66	90%
posix-aio	82.31	8.60	93%
linux-aio-sendfile	70.17	7.35	60%
others
Apache 2.2.4 (event)	50.92	5.33	40%
LiteSpeed 3.0rc2	55.58	5.80	40%

all the sendfile() implementations seem to have the same performance problem.
writve() and the threaded AIO backends utilize the network as expected
linux-aio-sendfile is faster as the buffered sendfile() even if it has to read everything from disk … strange

raw IO performance

February 03, 2007

In lighttpd 1.5.0 we support several network backends. Their job is to fetch static data from disk and send it to the client.

We want to compare the different backends for their performance and when you want to use which.

writev
linux-sendfile
gthread-aio
posix-aio
linux-aio-sendfile

The impact of the stat-threads shall also be checked.

We use a minimal configuration files:

server.document-root     = "/home/jan/wwwroot/servers/grisu.home.kneschke.de/pages/"

server.port              = 1025
server.errorlog          = "/home/jan/wwwroot/logs/lighttpd.error.log"

server.network-backend   = "linux-aio-sendfile"
server.event-handler     = "linux-sysepoll"
server.use-noatime       = "enable"

server.max-stat-threads  = 2
server.max-read-threads = 64

iostat, vmstat and http_load

We used iostat and vmstat to see how the system is handling the load.

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0 506300  12900  20620 437968    0    0 14204    17 5492  3323  4 23  9 63
 0  1 506300  11212  20620 439888    0    0 17720     4 6713  3966  3 29  3 66
 0  1 506300  11664  20632 440356    0    0 14460     8 5416  3120  2 24  2 71
 1  0 506300  18916  20612 433168    0    0 13180    50 5505  3088  2 23  2 72
 0  1 506300  11960  20628 440188    0    0 15860     6 5485  3307  2 24  3 71

$ iostat -xm 5
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
       2.20    0.00   24.40   70.40    0.00    3.00

Device:    rrqm/s wrqm/s    r/s   w/s  rsec/s  wsec/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
sda         67.40   0.40  68.00  0.80 13600.00   14.40     6.64     0.01   197.88    12.87  176.83  11.23  77.28
sdb         82.80   0.40  84.60  0.80 17000.00   14.40     8.30     0.01   199.23    23.18  280.16  11.61  99.12
md0          0.00   0.00 302.20  0.80 30520.00    6.40    14.90     0.00   100.75     0.00    0.00   0.00   0.00

Our http_load process returned:

>http_load -verbose -timeout 40 -parallel 100 -seconds 60 urls.100k
9117 fetches, 100 max parallel, 9.33581e+008 bytes, in 60 seconds
102400 mean bytes/connection
151.95 fetches/sec, 1.55597e+007 bytes/sec
msecs/connect: 5.47226 mean, 31.25 max, 0 min
msecs/first-response: 144.433 mean, 3156.25 max, 0 min
HTTP response codes:
  code 200 -- 9117

We will use the same benchmark and the same configuration to compare the different back-ends.

Comparision

As comparision I tried the add other web-servers to the ring. As always, benchmark have to taken with a grain of salt. Don’t trust them, try to repeat them yourself.

lighttpd 1.5.0-svn, config is above
litespeed 3.0rc2
- epoll and sendfile are enabled. All the other options are defaults.
Apache 2.2.4 event-mpm from the OpenSUSE 10.2 packages
- MinSpareThreads = 25
- MaxSpareThreads = 75
- ThreadLimit = 64
- ThreadsPerChild = 25

I’ll try to get a shrinked down, text based config-file which only contains the necessary options for others to repeat.

Expectations

The benchmark is supposed to show that async file-IO for single-threaded webservers is good. We expect that:

the blocking network-backends are slow
that Apache 2.2.4 offers the best performance as it is threaded + event-based
lighttpd + async file-io gets into the range of Apache2

The problem with the blocking file-IO is that a single-threaded server can do nothing else while it is waiting for a syscall to finish.

Benchmarks

Running the http_load against different backends shows the impact of async-io vs. sync-io.

100k files

lighttpd
backend	throughput	requests/s
writev	6.11MByte/s	59.77
linux-sendfile	6.50MByte/s	63.62
posix-aio	12.88MByte/s	125.75
gthread-aio	15.04MByte/s	147.08
linux-aio-sendfile	15.56MByte/s	151.95
others
litespeed 3.0rc2 (writev)	4.35MByte/s	42.78
litespeed 3.0rc2 (sendfile)	5.49MByte/s	53.68
apache 2.2.4	15.04MByte/s	146.93

For small files you can gain around 140% more throughput.

without no-atime

To show the impact of the server.use-noatime = "enable" we compare the vmstat output for the gthread-aio output with and without noatime:

With O_NOATIME enabled:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1 62 506300   9192  20500 470064    0    0 12426     5 7005  6324  3 27  7 63
 0 63 506300  10188  19768 469732    0    0 14154     2 8252  7614  3 30  0 67
 0 64 506300  10488  19124 470492    0    0 13589     0 8261  7483  3 27  0 69
 0 64 506300  10196  17952 473092    0    0 13062     8 7388  6560  3 25  8 65
 0 64 506300  10656  16836 474720    0    0 11790     0 6378  5074  2 23 11 64

With O_NOATIME disabled:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0 21 506300  10408  15452 491680    0    0 10515   326 5362  1619  2 22 19 57
 3  7 506300  11116  17888 487588    0    0 11020   493 6056  2400  7 25 10 58
 0  0 506300  10200  19704 488004    0    0  8840   365 4506  1622  2 20 29 49
 2 14 506300  10460  21624 485288    0    0 12422   428 6464  1986  2 26 11 60
 0  1 506300   9436  24116 485316    0    0 12640   513 7159  2109  3 28  2 67
 0 21 506300  11864  25768 481588    0    0  8760     5 4436  1571  2 19 39 41
 0 21 506300  10352  24892 483412    0    0 11941   339 6005  1913  3 24 12 6

You see how the bo (blocks out) goes up and how bi (blocks in) goes down in the same way. As you usually don’t need the atime (file access file), you should either mount the file-system with noatime, nodiratime or use the setting server.use-noatime = "enable". By default this setting is disabled to be backward compatible.

10MByte files

lighttpd
backend	throughput	requests/s	%disk-util	user + sys
writev	17.59MByte/s	2.35	50 %	25 %
linux-sendfile	33.13MByte/s	3.77	70 %	30 %
posix-aio	50.61MByte/s	5.69	98%	60%
gthread-aio	47.97MByte/s	5.51	100%	50%
linux-aio-sendfile	44.15MByte/s	4.95	90%	40 %
others
litespeed 3.0rc2 (sendfile)	22.18MByte/s	2.72	65%	35 %
Apache 2.2.4	42.81MByte/s	4.73	95%	40 %

For larger files the win that you have with async-io is still around 50%.

stat() threads

For small files the performance is largely influenced by the stat() sys-call. Before a file is open()ed for reading it is first checked, if the file exists, if it is a regular file and if we can read from it. This sys-call is not async by itself.

We will use the gthread-aio backend and the set of 100k files and run the benchmark again, this time changing the server.max-stat-threads from 0 to 16.

threads	throughput
0	8.55MByte/s
1	13.60MByte/s
2	14.18MByte/s
4	12.33MByte/s
8	12.62MByte/s
12	13.10MByte/s
16	12.71MByte/s

You should set the number of stat-threads equal to your number of disks for optimal performance.

read-threads

You can also tune the number of read-threads. Each disk-read request is queue and then executed in parallel by a pool of readers. The goal is to keep the disk utilized at 100% and hiding the seeks for a stat() and a read() in the time that lighttpd spends in sending the data to the network.

threads	throughput
1	6.83MByte/s
2	11.61MByte/s
4	13.02MByte/s
8	13.61MByte/s
16	13.81MByte/s
32	14.04MByte/s
64	14.87MByte/s

It looks like 2 reads per disk are already a good value.