Technologies left behind

A brief history of the dynamic web from CGI to applications with built-in HTTP servers

24 November 2023

The year is 1993. You need to develop a dynamic website, let's say a guestbook, which was quite popular at the time. How would you go about it? If your answer is to google how to do it, I have sad news for you, Google won't be a thing for another 5 years. AltaVista is still two years away. Stack Overflow? Another 15 years... Sure it wasn't easy to be a developer in the old days.

Sometimes it's good to look back to those old days to see if we can avoid the mistakes they made or prevent us from reinventing the wheel. That was one of my motivators to go on an adventure and learn about how web development evolved into what we know today.

Common Gateway Interface

They started to develop it in the early 1990s and a little later it became RFC 3875. As the name suggests, it is an interface between a web server and an application. What this means in practice is that if you have an arbitrary executable file, the web server can run it - after proper configuration - and return the output as a response.

The request data is received in environment variables and via standard input, and the response must be produced to the standard output, with small syntactic restrictions (the response must start with a Content-Type header).

The advantage is that it's simple, just copy a file to a directory, make it executable and you're done. The disadvantage is that each request means starting a new process, which can be slow and doesn't scale very well.

The easiest way to detect such configurations was that these applications usually lived in the /cgi-bin/ directory, which is still checked by automatic scanning tools today to see if they can find anything interesting there.

And when I said arbitrary executable, I meant it. Even a shell script can be the basis of a dynamic web page (if you are brave enough to parse query strings and multipart requests in a shell script):

#!/bin/sh

echo "Content-Type: text/plain"
echo
echo "Hello World!"

echo
echo "Environment:"
env

echo
echo "Input:"
cat -
echo

If you call this endpoint, the following data will be returned:

$ curl -d'foo=bar' 'http://127.0.0.1:8081/cgi-bin/test.sh?foo=bar'
Hello World!

Environment:
CONTENT_TYPE=application/x-www-form-urlencoded
GATEWAY_INTERFACE=CGI/1.1
REMOTE_ADDR=192.168.16.1
SHLVL=1
QUERY_STRING=foo=bar
HTTP_USER_AGENT=curl/7.88.1
DOCUMENT_ROOT=/usr/local/apache2/htdocs
REMOTE_PORT=51282
HTTP_ACCEPT=*/*
SERVER_SIGNATURE=
CONTENT_LENGTH=7
CONTEXT_DOCUMENT_ROOT=/usr/local/apache2/cgi-bin/
SCRIPT_FILENAME=/usr/local/apache2/cgi-bin/test.sh
HTTP_HOST=127.0.0.1:8081
REQUEST_URI=/cgi-bin/test.sh?foo=bar
SERVER_SOFTWARE=Apache/2.4.58 (Unix)
REQUEST_SCHEME=http
PATH=/usr/local/apache2/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
SERVER_PROTOCOL=HTTP/1.1
REQUEST_METHOD=POST
SERVER_ADDR=192.168.16.2
SERVER_ADMIN=you@example.com
CONTEXT_PREFIX=/cgi-bin/
PWD=/usr/local/apache2/cgi-bin
SERVER_PORT=8081
SCRIPT_NAME=/cgi-bin/test.sh
SERVER_NAME=127.0.0.1

Input:
foo=bar

The names of the environment variables may be familiar, they have been adopted in many places, probably to ease the transition from CGI.

The possibilities are endless, I've thrown in a few extra examples in the related Github repository (compiled binary from C code? Why not!), but Perl was probably the real star of the cgi-bin directory back then:

#!/usr/bin/perl

print "Content-type: text/plain\n\nHello, World.\n";

print "\nEnrivonment:\n";
foreach my $key (keys %ENV) {
    print "$key=$ENV{$key}\n";
}

print "\nInput:\n";
while (<>) {
    print;
}
print "\n";

Then in 1995, PHP arrived. At first, it was still a CGI script as well.

#!/usr/bin/php82
<?php

print("Content-Type: text/plain\n\nHello World!\n");

print("\nEnvironment:\n");
var_dump($_SERVER);

print("\nInput:\n");
var_dump(file_get_contents('php://stdin'));

Here I was a bit surprised to find the correct data in the $_SERVER array and not in the $_ENV array, maybe it's the newer PHP, or maybe I should have called it differently for CGI scripts, I don't know. But it doesn't really matter, because we could soon leave the CGI behind.

Alternative solutions

FastCGI

Also around 1995, FastCGI was released, which aims to address the performance problems of CGI. Based on the CGI::Fast package in Perl, it seems to work in several ways. A web server can start a CGI process in one or more instances, sending it FCGI requests on standard input and waiting for FCGI responses on standard output. It can also work by having the web server and the FCGI process communicate via a Unix socket or a regular network socket. The web server then converts the FCGI response to an HTTP response and you are done.

According to the protocol description, the web server can send multiple requests to a process at the same time, which the FCGI process can process in parallel if it supports it.

The advantage of this system is that it is easier to implement an FCGI server than an HTTP server (the original HTTP/1.0 RFC 1945 is 60 pages long, and the HTTP/1.1 RFC 2068 is 162 pages long). The disadvantage may be that there is a fairly trusting relationship between the web server and the FCGI server, so if someone else can accidentally talk to the FCGI server directly, it may not end well (for example, the FCGI server code may not handle malformed requests as well as the HTTP server, or it may be possible to bypass the authentication enforced by the web server this way).

As I mentioned, the FastCGI protocol is simpler than HTTP, so applications can more easily implement it. Just for fun, I quickly threw together a simple Python FCGI server that can return a similar response as our previous CGI scripts to any request.

mod_php

Around 1997, PHP 3 and the mod_php Apache module were released. At least based on my web archive findings, I concluded that mod_php came with PHP 3, but I'm not entirely sure. It's probably not that relevant to the story.

In the case of mod_php, the PHP interpreter runs inside the Apache process and executes the PHP files that way. The tighter integration has its advantages because you don't have to start a new process per request, but it has its drawbacks as well. The PHP interpreter still occupies memory even if the request is for a static file.

All in all, however, we can say it was quite successful, and to date, it is the recommended way to run PHP code with an Apache web server.

Simple Common Gateway Interface

FastCGI proved to be not simple enough, so a new competitor, SCGI, was introduced around 2001. The SCGI protocol is much simpler, but only one request can be made on a connection at a time.

For comparison, I have written a simple little SCGI server in Python as well that works similarly to the FastCGI server.

FastCGI, second round

In 2010, just over 15 years after the protocol's release, FastCGI support for PHP arrived in the form of the FastCGI Process Manager (FPM).

On top of that, some people got tired of Apache being too slow (a recurring theme throughout our story), and they brought us the Nginx web server in 2004. Now we have a decent alternative to Apache and mod_php in the form of Nginx and PHP-FPM.

The world is changing

As time went by, more and more languages wanted to be web-compatible. In 2003 came Python with WSGI, and in 2004 Ruby on Rails was released, which could initially run as CGI, FastCGI, or later with mod_ruby. Then in 2007 came Rack, which is a similar interface to WSGI for Ruby.

This works roughly by getting the HTTP request data from somewhere (web server written in the language, CGI, FCGI, whatever), it is then transformed into a unified structure according to the web interface of the language, which is then received by the application.

For Python, for example, this might look like this:

HTTP request -> Gunicorn -> WSGI environment -> Flask -> the code we wrote

For Ruby, something like this:

HTTP request -> Unicorn -> Rack environment -> Sinatra -> the code we wrote

While in theory, the source of the HTTP request could be several things, in practice it seems that the web server written in the language of choice has been the winner. Interestingly, this is where we start to move away from the technologies that were previously invented. Why implement a complicated HTTP server when there is a simpler alternative? Wouldn't an FCGI or SCGI server have been enough? Who knows.

Modern web development

Around 2009, Node.js was released because again someone didn't like that Apache was too slow and couldn't handle enough requests. Go was also released around that time. Both languages included HTTP servers in their standard library, which I think decided how web applications would be developed in these languages.

The general solution was to write frameworks around the built-in HTTP server, and applications would use those frameworks, so each application became its own web server.

Of course, the world has changed a lot in that time. Large applications have been split up into many small applications, where it has become increasingly rare to return full or partial HTML pages (so much so that rendering templates on the server side is a novelty for the newer generation), and so the needs have changed as well.

There are usually already some proxies in front of applications (HAProxy, Nginx, Traefik, and others) that deal mostly with HTTP requests, so it would just be an extra (probably unnecessary) moving part in the system to have another HTTP server in front of the application, whose only job is to translate from HTTP to, say, FCGI.

There's a good chance that optimization doesn't matter as much as it used to either. You don't necessarily have to have the HTTP server written in C, a Python implementation can still provide the performance you need.

Summary

We've come a long way, and perhaps forgotten a lot during the journey, but the things mentioned above are still alive and well (or at least functional), you can even try them out with the related CGI playground repository. There may even be cases where they are worth using. It would be a shame to waste a Kubernetes cluster on a problem that a CGI script can solve without an issue.

deadlime