thread starvation
March 5th, 2007
While running several benchmarks I saw that http_load was getting time outs for some of the connections. This has been seen with all web-servers and different backends in lighttpd as well.
After patching http_load to handle timed out and byte count errors correctly, I could easily separate the time outs from other problems. In one of the last changesets I added a infrastructure to track the time spent in lighttpd for a single connection including the time spent in the different stages of the gthread-aio backend:
- scheduling the threaded-read()
- starting the read() in the thread
- waiting until it is finish
- sending the result into the main-loop
- writing the buffered data to the socket
You can enable this timing by setting the define LOG_TIMING.
network_gthread_aio.c.78: (trace) write-start: 1173101341.237616 read-queue-wait: 680 ms read-time: 0 ms write-time: 21 ms
network_gthread_aio.c.78: (trace) write-start: 1173101341.229014 read-queue-wait: 576 ms read-time: 0 ms write-time: 134 ms
network_gthread_aio.c.78: (trace) write-start: 1173101341.240815 read-queue-wait: 681 ms read-time: 6 ms write-time: 19 ms
I wrote a script to extract this timing data from errorlog and used gnuplot to turn it to a images:
#!/bin/sh
## parse the errorlog for the lines from the timing_print
## - extract the numbers
## - sort it by start-time
## - make all timestamps relative to the first start-time
cat benchmark.error.log | \
grep "network_gthread_aio.c.78:" | \
awk '{ print $4,$6,$9,$12 }' | \
sort -n | \
perl -ne '@e=split / /;if (!$start) { $start = $e[0]; } $e[0] = $e[0] - $start; print join " ", @e; ' > $1.data
cat <<EOF
set autoscale
set terminal png
set xlabel "start-time of a request"
set ylabel "ms per request"
set yrange [0:30000]
set title "$1"
set output "$1.png"
plot \
"$1.data" using 1:2 title "read-queue-wait-time" with points ps 0.2, \
"$1.data" using 1:(\$2 + \$3) title "read-time" with points ps 0.2, \
"$1.data" using 1:(\$2 + \$3 + \$4) title "write-time" with dots
set title "$1 (read-queue-wait)"
set output "$1-read-queue-wait.png"
plot \
"$1.data" using 1:2 title "read-queue-wait-time" with points ps 0.8
set title "$1 (read)"
set output "$1-read.png"
plot \
"$1.data" using 1:3 title "read-time" with points ps 0.8 pt 2
set title "$1 (write)"
set output "$1-write.png"
plot \
"$1.data" using 1:4 title "write-wait-time" with points ps 0.8 pt 3
EOF
The first benchmark was taking:
./http_load -parallel 100 -fetches 500 ./http-load.10M.urls-10G
and
server.max-read-threads = 64
## compiled with 64k read-ahead
The time spent with read()ing the data from the disk goes up:

more detailed:

If you reduce the threads to 4 you get:

and the read-time drops to:

Interesting for our time-outs are only the dots which leave the 4 sec range as that are our starving read() threads. If it takes too long for them to finish the client will close the connection and the user will get a broken transfer.
Reducing the number of threads helps to limit the impact of the problem as we can see above in the graphs:
threads-runnable = threads-started - threads-blocked
The probability of a stuck thread to get CPU-time again is increasing the more threads are getting stuck as less threads can run. In the worst case all available threads are waiting and at least one of them will get finished.
more graphs
- 4 read-threads, 10MByte files, 64kbytes read-ahead
- 4 read-threads, 10MByte files, 1MBytes read-ahead
- 64 read-threads, 10MByte files, 64kbytes read-ahead
- 64 read-threads, 10MByte files, 1MBytes read-ahead
Rule of thumb
Keep max-threads at twice the number of disk.
buffered IO performance
February 11th, 2007
Next to the raw-IO performance which is important for heavy, static file transfers the buffered IO performance is more interesting for sites which have a small set of static files which can be kept in the fs-caches.
As we are using hot-caches for this benchmark the "lightness" of the server becomes important. The less syscalls it has to do, the better.
The test-case is made up of 100MByte of files in the size of 10MByte and 100kByte.
Benchmark
100kByte
100MByte of 100kBytes files served from the hot caches:
| lighttpd | |||
|---|---|---|---|
| backend | MByte/s | req/s | user + sys |
| writev | 82.20 | 802.71 | 90% |
| linux-sendfile | 70.27 | 686.32 | 56% |
| gthread-aio | 75.39 | 736.23 | 98% |
| posix-aio | 73.10 | 713.88 | 98% |
| linux-aio-sendfile | 31.32 | 305.90 | 35% |
| others | |||
| Apache 2.2.4 (event) | 70.28 | 686.38 | 60% |
| LiteSpeed 3.0rc2 | 70.20 | 685.65 | 50% |
linux-aio-sendfileis loosing most of its performance as it has to useO_DIRECTto operation which always is a unbuffered read.- Apache, LiteSpeed and
linux-sendfileare using the same syscall:sendfile()and end up with the same performance values - gthread-aio and posix-aio perform better than
sendfile() write()performs better thanthe threaded AIOandsendfile()I can't explain that right now :)
10MByte
100MByte of 10MBytes files served from the hot caches. The benchmark command has been changed as in the other benchmarks:
$ http_load -verbose -timeout 40 -parallel 100 -fetches 500 http-load.10M.urls-100M
http_load is doing a hard cut when we are using the -seconds option and we might lose some MByte/s due to incomplete transfers.
| lighttpd | |||
|---|---|---|---|
| backend | MByte/s | req/s | user + sys |
| writev | 82.20 | 8.76 | 80% |
| linux-sendfile | 53.95 | 5.65 | 40% |
| gthread-aio | 83.02 | 8.66 | 90% |
| posix-aio | 82.31 | 8.60 | 93% |
| linux-aio-sendfile | 70.17 | 7.35 | 60% |
| others | |||
| Apache 2.2.4 (event) | 50.92 | 5.33 | 40% |
| LiteSpeed 3.0rc2 | 55.58 | 5.80 | 40% |
- all the
sendfile()implementations seem to have the same performance problem. writve()and thethreaded AIObackends utilize the network as expectedlinux-aio-sendfileis faster as the bufferedsendfile()even if it has to read everything from disk ... strange
raw IO performance
February 3rd, 2007
In lighttpd 1.5.0 we support several network backends. Their job is to fetch static data from disk and send it to the client.
We want to compare the different backends for their performance and when you want to use which.
- writev
- linux-sendfile
- gthread-aio
- posix-aio
- linux-aio-sendfile
The impact of the stat-threads shall also be checked.
We use a minimal configuration files:
server.document-root = "/home/jan/wwwroot/servers/grisu.home.kneschke.de/pages/"
server.port = 1025
server.errorlog = "/home/jan/wwwroot/logs/lighttpd.error.log"
server.network-backend = "linux-aio-sendfile"
server.event-handler = "linux-sysepoll"
server.use-noatime = "enable"
server.max-stat-threads = 2
server.max-read-threads = 64
iostat, vmstat and http_load
We used iostat and vmstat to see how the system is handling the load.
$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 506300 12900 20620 437968 0 0 14204 17 5492 3323 4 23 9 63
0 1 506300 11212 20620 439888 0 0 17720 4 6713 3966 3 29 3 66
0 1 506300 11664 20632 440356 0 0 14460 8 5416 3120 2 24 2 71
1 0 506300 18916 20612 433168 0 0 13180 50 5505 3088 2 23 2 72
0 1 506300 11960 20628 440188 0 0 15860 6 5485 3307 2 24 3 71
$ iostat -xm 5
avg-cpu: %user %nice %system %iowait %steal %idle
2.20 0.00 24.40 70.40 0.00 3.00
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
sda 67.40 0.40 68.00 0.80 13600.00 14.40 6.64 0.01 197.88 12.87 176.83 11.23 77.28
sdb 82.80 0.40 84.60 0.80 17000.00 14.40 8.30 0.01 199.23 23.18 280.16 11.61 99.12
md0 0.00 0.00 302.20 0.80 30520.00 6.40 14.90 0.00 100.75 0.00 0.00 0.00 0.00
Our http_load process returned:
>http_load -verbose -timeout 40 -parallel 100 -seconds 60 urls.100k
9117 fetches, 100 max parallel, 9.33581e+008 bytes, in 60 seconds
102400 mean bytes/connection
151.95 fetches/sec, 1.55597e+007 bytes/sec
msecs/connect: 5.47226 mean, 31.25 max, 0 min
msecs/first-response: 144.433 mean, 3156.25 max, 0 min
HTTP response codes:
code 200 -- 9117
We will use the same benchmark and the same configuration to compare the different back-ends.
Comparision
As comparision I tried the add other web-servers to the ring. As always, benchmark have to taken with a grain of salt. Don't trust them, try to repeat them yourself.
- lighttpd 1.5.0-svn, config is above
- litespeed 3.0rc2
- epoll and sendfile are enabled. All the other options are defaults.
- Apache 2.2.4 event-mpm from the OpenSUSE 10.2 packages
- MinSpareThreads = 25
- MaxSpareThreads = 75
- ThreadLimit = 64
- ThreadsPerChild = 25
I'll try to get a shrinked down, text based config-file which only contains the necessary options for others to repeat.
Expectations
The benchmark is supposed to show that async file-IO for single-threaded webservers is good. We expect that:
- the blocking network-backends are slow
- that Apache 2.2.4 offers the best performance as it is threaded + event-based
- lighttpd + async file-io gets into the range of Apache2
The problem with the blocking file-IO is that a single-threaded server can do nothing else while it is waiting for a syscall to finish.
Benchmarks
Running the http_load against different backends shows the impact of async-io vs. sync-io.
100k files
| lighttpd | ||
|---|---|---|
| backend | throughput | requests/s |
| writev | 6.11MByte/s | 59.77 |
| linux-sendfile | 6.50MByte/s | 63.62 |
| posix-aio | 12.88MByte/s | 125.75 |
| gthread-aio | 15.04MByte/s | 147.08 |
| linux-aio-sendfile | 15.56MByte/s | 151.95 |
| others | ||
| litespeed 3.0rc2 (writev) | 4.35MByte/s | 42.78 |
| litespeed 3.0rc2 (sendfile) | 5.49MByte/s | 53.68 |
| apache 2.2.4 | 15.04MByte/s | 146.93 |
For small files you can gain around 140% more throughput.
without no-atime
To show the impact of the server.use-noatime = "enable" we compare the vmstat output for the gthread-aio output
with and without noatime:
With O_NOATIME enabled:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 62 506300 9192 20500 470064 0 0 12426 5 7005 6324 3 27 7 63 0 63 506300 10188 19768 469732 0 0 14154 2 8252 7614 3 30 0 67 0 64 506300 10488 19124 470492 0 0 13589 0 8261 7483 3 27 0 69 0 64 506300 10196 17952 473092 0 0 13062 8 7388 6560 3 25 8 65 0 64 506300 10656 16836 474720 0 0 11790 0 6378 5074 2 23 11 64
With O_NOATIME disabled:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 21 506300 10408 15452 491680 0 0 10515 326 5362 1619 2 22 19 57 3 7 506300 11116 17888 487588 0 0 11020 493 6056 2400 7 25 10 58 0 0 506300 10200 19704 488004 0 0 8840 365 4506 1622 2 20 29 49 2 14 506300 10460 21624 485288 0 0 12422 428 6464 1986 2 26 11 60 0 1 506300 9436 24116 485316 0 0 12640 513 7159 2109 3 28 2 67 0 21 506300 11864 25768 481588 0 0 8760 5 4436 1571 2 19 39 41 0 21 506300 10352 24892 483412 0 0 11941 339 6005 1913 3 24 12 6
You see how the bo (blocks out) goes up and how bi (blocks in) goes down in the same way. As you usually don't need the atime (file access file), you should either mount the file-system with noatime, nodiratime or use the setting server.use-noatime = "enable". By default this setting is disabled to be backward compatible.
10MByte files
| lighttpd | ||||
|---|---|---|---|---|
| backend | throughput | requests/s | %disk-util | user + sys |
| writev | 17.59MByte/s | 2.35 | 50 % | 25 % |
| linux-sendfile | 33.13MByte/s | 3.77 | 70 % | 30 % |
| posix-aio | 50.61MByte/s | 5.69 | 98% | 60% |
| gthread-aio | 47.97MByte/s | 5.51 | 100% | 50% |
| linux-aio-sendfile | 44.15MByte/s | 4.95 | 90% | 40 % |
| others | ||||
| litespeed 3.0rc2 (sendfile) | 22.18MByte/s | 2.72 | 65% | 35 % |
| Apache 2.2.4 | 42.81MByte/s | 4.73 | 95% | 40 % |
For larger files the win that you have with async-io is still around 50%.
stat() threads
For small files the performance is largely influenced by the stat() sys-call. Before a file is open()ed for reading it is first checked, if the file exists, if it is a regular file and if we can read from it. This sys-call is not async by itself.
We will use the gthread-aio backend and the set of 100k files and run the benchmark again, this time changing the server.max-stat-threads from 0 to 16.
| threads | throughput |
|---|---|
| 0 | 8.55MByte/s |
| 1 | 13.60MByte/s |
| 2 | 14.18MByte/s |
| 4 | 12.33MByte/s |
| 8 | 12.62MByte/s |
| 12 | 13.10MByte/s |
| 16 | 12.71MByte/s |
You should set the number of stat-threads equal to your number of disks for optimal performance.
read-threads
You can also tune the number of read-threads. Each disk-read request is queue and then executed in parallel by a pool of
readers. The goal is to keep the disk utilized at 100% and hiding the seeks for a stat() and a read() in the time that lighttpd spends in sending the data to the network.
| threads | throughput |
|---|---|
| 1 | 6.83MByte/s |
| 2 | 11.61MByte/s |
| 4 | 13.02MByte/s |
| 8 | 13.61MByte/s |
| 16 | 13.81MByte/s |
| 32 | 14.04MByte/s |
| 64 | 14.87MByte/s |
It looks like 2 reads per disk are already a good value.
Benchmarks
February 2nd, 2007
In the article lighty 1.5.0 and linux-aio we proposed a benchmark suite for measuring the performance for a disk-io-bound application.
http_load
As load-generator we use http_load as it
- allows random fetches from a list of URLs
- allows a large number of parallel requests
- is portable
On the command-line we want to execute it like:
$ ./http_load -verbose -parallel 100 -fetches 10000 urls
file pool
On the machine which supposed to be tested we generate 2 sets of 10Gbyte files. One is of 100,000 files of 100kbyte size, and the other is 1.000 files of 10MByte size.
$ cd $docroot
$ mkdir -p seek-bound/100k/
$ cd seek-bound/100k/
$ for i in `seq 1 1000`; do
mkdir -p files-$i;
for j in `seq 1 100`; do
dd if=/dev/zero of=files-$i/$j bs=100k count=1 2> /dev/null;
done;
done
The file-pool is 10 times larger than the available RAM on the server-host. Based on this disk layout we generate the list of URLs for http_load.
$ find ./seek-bound/100k/ | grep 'files.*/.' | sed 's#./#http://192.168.2.106/#' > http-load.100k.urls
The same commands are executed for the 10MByte files to generate a file-set which check the performance for large files.
hardware
The test-network is made up of:
- Netgear GS108, a 8-port Gigabit Switch
- client:
- OS: WinXP Prof. 64-bit
- CPU: AMD64 X2 (dual core) 4200+
- Network: Intel Pro/1000
- server:
- OS: Linux 2.6.16.21-0.25-default x86_64 (OpenSuse 10.1)
- CPU: AMD64 3000+
- Network: Intel Pro/1000
- Modules: stock, but ip_conntrack is
rmmod'ed - Disks: 2 SATA disks as RAID1 via the md-driver
The disks are:
Model Number: ST3160827AS
Serial Number: 5MT02VGJ and 3MT08WDV
Firmware Revision: 3.42
$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda2[0] sdb2[1]
155235968 blocks [2/2] [UU]