Lighttpd 1.4.14 released

April 13, 2007

We are pleased to announce the release of lighttpd 1.4.14. This is mainly a bug fix release including 2 security fixes. It is recommended to upgrade or at least apply the patches.

Download

As 1.4.14 had a cookies related bug please use 1.4.15

Thanks for using lighttpd!:)

Powered by Lighttpd

April 04, 2007

lighttpd is used by many well-known sites. The typical scenario is using lighttpd as off-load server to push out static content and leave to complex work to another server.

One example is YouTube. They have a farm of servers which push out the thumbnails you see before you see the movies:

$ curl -I http://sjl-static16.sjl.youtube.com/vi/TgF_eRkfqEY/2.jpg
HTTP/1.1 200 OK
Content-Type: image/jpeg
ETag: "983726135810477085"
Accept-Ranges: bytes
Last-Modified: Fri, 03 Feb 2006 04:32:53 GMT
Content-Length: 3495
Date: Wed, 04 Apr 2007 06:49:51 GMT
Server: lighttpd-aio/1.4.11.8

As you see in the name-scheme, there are some more of those servers pushing out content.

On wikipedia they run at least 2 servers with lighttpd:

  • upload.wikimedia.org
  • download.wikimedia.org

While download is used to distribute the SQL dumps of the database when tend to grow above the magic 4GByte border, upload is used to push out all the images and resize them when necessary.

$ curl -I http://upload.wikimedia.org/wikipedia/commons/thumb/2/21/Mandel_zoom_00_mandelbrot_set.jpg/250px-Mandel_zoom_00_mandelbrot_set.jpg
HTTP/1.0 200 OK
Content-Type: image/jpeg
ETag: "6460328581220324712"
Accept-Ranges: bytes
Last-Modified: Mon, 04 Dec 2006 22:24:53 GMT
Content-Length: 5973
Date: Wed, 14 Mar 2007 11:16:19 GMT
Server: lighttpd/1.4.13
X-Cache: HIT from sq13.wikimedia.org
X-Cache-Lookup: HIT from sq13.wikimedia.org:3128
X-Cache: HIT from knsq12.knams.wikimedia.org
X-Cache-Lookup: HIT from knsq12.knams.wikimedia.org:3128
Age: 6265
X-Cache: HIT from knsq10.knams.wikimedia.org
X-Cache-Lookup: HIT from knsq10.knams.wikimedia.org:80
Via: 1.0 sq13.wikimedia.org:3128 (squid/2.6.STABLE12), 1.0 knsq12.knams.wikimedia.org:3128 (squid/2.6.STABLE12), 1.0 knsq10.knams.wikimedia.org:80 (squid/2.6.STABLE12)
Connection: close

thread starvation

March 05, 2007

While running several benchmarks I saw that http_load was getting time outs for some of the connections. This has been seen with all web-servers and different backends in lighttpd as well.

After patching http_load to handle timed out and byte count errors correctly, I could easily separate the time outs from other problems. In one of the last changesets I added a infrastructure to track the time spent in lighttpd for a single connection including the time spent in the different stages of the gthread-aio backend:

  • scheduling the threaded-read()
  • starting the read() in the thread
  • waiting until it is finish
  • sending the result into the main-loop
  • writing the buffered data to the socket

You can enable this timing by setting the define LOG_TIMING.

network_gthread_aio.c.78: (trace) write-start: 1173101341.237616 read-queue-wait: 680 ms read-time: 0 ms write-time: 21 ms
network_gthread_aio.c.78: (trace) write-start: 1173101341.229014 read-queue-wait: 576 ms read-time: 0 ms write-time: 134 ms
network_gthread_aio.c.78: (trace) write-start: 1173101341.240815 read-queue-wait: 681 ms read-time: 6 ms write-time: 19 ms

I wrote a script to extract this timing data from errorlog and used gnuplot to turn it to a images:

#!/bin/sh

## parse the errorlog for the lines from the timing_print
## - extract the numbers
## - sort it by start-time
## - make all timestamps relative to the first start-time

cat benchmark.error.log | \
    grep "network_gthread_aio.c.78:" | \
    awk '{ print $4,$6,$9,$12 }'  | \
    sort -n | \
    perl -ne '@e=split / /;if (!$start) { $start = $e[0]; } $e[0] = $e[0] - $start; print join " ", @e; ' > $1.data

cat <<EOF
set autoscale
set terminal png
set xlabel "start-time of a request"
set ylabel "ms per request"
set yrange [0:30000]

set title "$1"
set output "$1.png"
plot \
"$1.data" using 1:2 title "read-queue-wait-time" with points ps 0.2, \
"$1.data" using 1:(\$2 + \$3) title "read-time" with points ps 0.2, \
"$1.data" using 1:(\$2 + \$3 + \$4) title "write-time" with dots

set title "$1 (read-queue-wait)"
set output "$1-read-queue-wait.png"
plot \
"$1.data" using 1:2 title "read-queue-wait-time" with points ps 0.8

set title "$1 (read)"
set output "$1-read.png"
plot \
"$1.data" using 1:3 title "read-time" with points ps 0.8 pt 2

set title "$1 (write)"
set output "$1-write.png"
plot \
"$1.data" using 1:4 title "write-wait-time" with points ps 0.8 pt 3

EOF

The first benchmark was taking:

./http_load -parallel 100 -fetches 500 ./http-load.10M.urls-10G

and

server.max-read-threads = 64
## compiled with 64k read-ahead

The time spent with read()ing the data from the disk goes up:

64 threads

more detailed:

64 threads, read() calls

If you reduce the threads to 4 you get:

4 threads

and the read-time drops to:

4 threads, read() calls

Interesting for our time-outs are only the dots which leave the 4 sec range as that are our starving read() threads. If it takes too long for them to finish the client will close the connection and the user will get a broken transfer.

Reducing the number of threads helps to limit the impact of the problem as we can see above in the graphs:

threads-runnable = threads-started - threads-blocked

The probability of a stuck thread to get CPU-time again is increasing the more threads are getting stuck as less threads can run. In the worst case all available threads are waiting and at least one of them will get finished.

more graphs

Rule of thumb

Keep max-threads at twice the number of disk.