Web application performance and scalability (2005)

« A web application's limited resourcesPerformance and scalability techniques 101 »

Key performance and scalability technologies

Now that you're familiar with the limited resources that are closely tied to a web application's performance and scalability, its time to take a closer look at the key technologies involved in the context of performance and scalability issues.

Hardware

Though CPU and memory -- discussed in the last chapter -- play a vital role in a web application's performance and scalability due to their limited nature, they are but only two hardware components in the slew of hardware related choices that influence a web application's performance and scalability. The hardware related topics discussed next are not so much limited as they are design choices, this is why I describe them as key technologies.

Unless your charged with managing the data center where a web application resides, you won't need to tinker with many of the intricacies involved in using the following hardware technologies. Nevertheless, most service providers -- including cloud computing providers -- either rely on these technologies to build their core services or offer them as options, reason why its important to familiarize yourself them.

Virtualization

Virtualization is a technology that divides a single piece of hardware or box into multiple instances. Prior to virtualization, hardware and an operating system (OS) were one in the same. This made assigning resources on an OS inflexible, since a single OS needed to use all the resources present in a box. Figure 3-1 illustrates the basic scheme of using virtualization.

Virtualization partitioning
Figure 3-1 - Virtualization used to partition hardware into various nodes.

As you can see in this last figure, a single box can host various instances or nodes, each with its own resources, such as physical memory, CPU, permanent storage space (I/O capacity), network address (bandwidth) and other such resources. This process allows for a more efficient way to manage a web application's resources.

Note Jails, Containers and Zones - Related Virtualization techniques

Even though the term virtualization started being used in the 1960's, a series of Unix OSs offer similar mechanisms under different names. Unix OSs of the BSD kind offer 'jails', where as the Unix OS Solaris offer 'Containers/Zones'.

As you can imply by these names, the purpose of these approaches is similar to virtualization, to isolate portions of a box to run multiple OSs. In many cases, these approaches are still used on large boxes for operational constraints like security and management.

However, even though these virtualization techniques are effective, the evolving nature of the virtualization industry has made these last approaches look rudimentary compared to more recent virtualization products.

For example, a web application hosted on a relatively small node can work without the cost penalties of using an entire box and still be able to scale by re-assigning resources from one node to another. In a similar fashion, unused resources in a box are movable to another node for better use. This resource assignment process is clear-cut since its virtual. In other words, there is no need to re-install a web application's components or deal with the often uncertain process of migrating a web application to another host which often implies re-installing a fresh OS. The way virtualization achieves this last process is by using a hypervisor.

A hypervisor is a host operating system that manages a box's resources for several guest operating systems or nodes. Therefore, when you deploy a web application with a provider that relies on virtualization, what you're given is actually a guest operating system managed by a hypervisor. This is why it's often so easy to scale and assign more resources to a web application with a click of a mouse. Instead of re-installing or migrating a web application to another box, the underlying hypervisor shifts resources from one guest OS or node to another.

Needless to say that with virtualization emerging in the 1960's, there are many hypervisors from which to choose. Among the most popular you will find Xen and Virtual Box -- both of which are open-source projects -- as well as commercial offerings by companies like VMware and Nimbus data systems , among other niche players.

Though the process of virtualization is apparently without fault, it can have certain disadvantages. For instance, a hardware box's configuration is not clear to node users. A hardware box could be near its overall hardware limit, in which case more resources for a node could be limited, even requiring migration to another box. Another aspect that can emerge is poorly virtualized resources such as I/O capacity -- where multiple nodes in the same box attempt to read/write from the same set of hard drives -- resulting in increased latency. The sidebar has more details on the topic virtualization and I/O capacity.

Many of these last virtualization disadvantages are at the heart of vertical and horizontal scaling strategies, topics discussed in the next chapter.

Note Virtualization and I/O capacity

Even though managing CPU & memory has been a mature technique in hypervisors for some time, other areas like I/O capacity have been slower to mature. This can have adverse consequences when the node/box ratio is high, making nodes concurrently access the same hard drives with no way to fairly distribute load.

There are series of tests you can do to determine if a node suffers from I/O contention, with some discussed in the upcoming chapters. It's also worth mentioning the steady progress made by niche virtualization players like Xsigo which are addressing this specific problem. If you're interested in how virtualization has evolved and continues to mature, you might find 10 Years of Virtual Machine Performance (Semi) Demystified an interesting read.

Redundant Array of Inexpensive Disks - RAID

RAID is a technology designed to increase a system's fault tolerance to hardware failures -- specifically hard drives -- as well as increase a web application's data throughput through replication. RAID consists of using an array of hard drives. In case a hardware failure occurs on any member of an array, access to its data goes uninterrupted, essentially a production system continues as if nothing happened (i.e. real-time data recovery). Equally, by distributing an application's data among the members of an array, data throughput increases by allowing reads/writes on multiple locations. Figure 3-2 illustrates the principles of using RAID.

RAID architecture
Figure 3-2 - RAID used for fault-tolerance/real-time data recovery & parallel data access.

There are various ways to configure RAID. The first implementation detail has to do with levels. RAID levels -- represented as numbers -- indicate different degrees of fault tolerance and replication. Generally speaking, the higher a RAID level, the more robust a RAID array is and as a result the more expensive it will be. Table 3-1 describes the most common RAID levels.

Table 3-1 Most common RAID levels, description & key benefits
Level*NameDescriptionKey benefit(s)
RAID 0**Striping with no parityConsists of dividing files and saving them to multiple members of an array, a process known as striping. RAID 0 provides block striping.Results in increased throughput, since files are written and read to multiple members of an array.
RAID 1MirroringConsists of mirroring (i.e. copying) files to multiple members of an array. Results in increased fault tolerance, since files are duplicated across members of an array.Results in increased throughput, since files can be read of multiple members of an array.
RAID 3/4Striping with parityConsists of dividing files and saving them to multiple members of an array, in addition to creating parity data capable of restoring any file in an array. Parity data is stored on a single member of an array. RAID 3 provides byte parity. RAID 4 provides block parity. Results in increased fault tolerance over RAID 1, since files can be re-built with parity data.
RAID 5Striping with distributed parityConsists of dividing files and saving them to multiple members of an array, in addition to creating parity data capable of restoring any file in an array. Parity data is distributed to all members of an array. RAID 5 provides block parity. Results in increased fault tolerance over RAID 3/4, since parity data is distributed to all members of an array, thus there is no single point of failure.
RAID 10 or 1+0Striping with distributed parity and mirroringConsists of dividing files and saving them to multiple members of an array, mirroring them, in addition to creating parity data capable of restoring any file in an array. Parity data is distributed among all mirrored members of an array. Results in increased throughput, since files can be read of multiple members of an array. Results in increased fault tolerance over RAID 5, since parity data is mirrored and distributed to all members of an array.
* There are other RAID levels not included in this table (e.g. RAID 2, 4, 6) offering different functionalities. But they are not as widely used.
**This level is often not catalogued as RAID, since there is no type of fault tolerance (parity). But since it provides higher throughput by striping, a RAID denomination is often used.

In addition to levels, RAID can also be implemented in either software or hardware. Hardware RAID relies on a RAID controller, this is a physical piece of equipment attached to an array of hard drives. Software RAID relies on a host OS, where the host OS executes a series of routines to do the work on an array of hard drives.

In terms of performance and scalability, software RAID is more resource intensive. This is because a host OS requires CPU cycles and memory to do RAID operations. Where as hardware RAID -- relying on a controller -- requires fewer resources that don't take away from a host OS.

In addition to performance and scalability concerns, another important factor in RAID design is price. Since hardware RAID requires a dedicated controller, it's far more expensive than software RAID solutions. Software RAID is set up on standard hard drives -- no special hardware required.

Another factor to consider is configuration. Software RAID is prone to bad configurations or OS errors that can turn RAID data useless. Hardware RAID relying on a RAID controller's micro-code -- provided by a manufacturer -- is less likely to suffer these glitches. It's also worth mentioning that software RAID is generally limited to RAID 0, RAID 1 and RAID 5. More advanced RAID strategies are the realm of hardware RAID, supported only through RAID controllers.

Finally, like virtualization technology which operates 'under the hood' of an application, RAID also operates transparently from an application. To an OS using an array of hard drives configured using either hardware RAID or software RAID, operations are performed against a volume.

This means that an application reading or writing data doesn't have to figure out on which set of disks it has to do operations, it simply works on a single volume (e.g. hdc). This is similar to an application reading or writing data from a single hard drive (e.g. hdc). In both cases, all an application visualizes is a single access point.

However, given that in a volume's case an application's operations are actually performed against an array of disks, this results in an application having increased data throughput and fault-tolerance given RAID's characteristics (i.e. mirroring data and stripping data). On which disks mirroring or striping runs, depends on either sub-routines provided by an OS's volume manager -- in the case of software RAID -- or by a controller -- in the case of hardware RAID.

Now that you're familiar with the principles of RAID, I can elaborate on a few more concepts related to performance and scalability in the context of hardware.

Direct Attached Storage

Direct Attached Storage or DAS is a term used to describe a storage device -- generally a hard drive -- connected to a box through a host bus adapter (HBA). If you've ever installed a storage device on any type of computer -- be it workstation or server -- then you've already worked with DAS.

There are many types of HBA, which makes DAS vary in the throughput and I/O capacity it can offer. Among the more common HBA/DAS protocols are: Universal Serial Bus (USB), Advanced Technology Attachment (ATA), Serial ATA (SATA), External Connectivity SATA (eSATA), Small Computer System Interface (SCSi), Serial Attached SCSi (SAS) and Fibre Channel. Generally speaking, the latter HBA/DAS protocols in this last list have higher throughput and I/O capacity -- though they are the most expensive -- while the first HBA/DAS protocols in the list have lower throughput and I/O capacity -- though they are the most affordable.

All HBA/DAS protocols by specification have a maximum bandwidth or throughput at which they're capable of transferring data. Table 3-2 has some of these values.

Table 3-2 HBA/DAS protocol maximum bandwidth/throughput
ProtocolMaximum bandwidth
USB0.2, 1.5, 60 & 500 MiB/s
ATA16, 33, 66, 100 & 133 MiB/s
SATA150 MiB/s, 300 MiB/s, 1.5 GiB/s & 3 GiB/s
SCSi640 MiB/s, 3 GiB/s & 6 GiB/s
Fibre Channel800 MiB/s, 5 GiB/s

On the matter of I/O capacity, values can vary substantially by manufacturer, HBA/DAS protocol and the usage intended for a storage device (e.g. serving static files with a web server, read operations with a RDBMS, write operations with a RDBMS, etc). Though as a rule of thumb, the higher a HBA/DAS protocol maximum bandwidth the higher its I/O capacity.

In another chapter of the book I will illustrate how to do a series of I/O capacity benchmarks on your own or provider's hardware. Another meaningful source of information about hard drives, HBA/DAS protocols and I/O capacity is Tom's Hardware which publishes a series of benchmarks on these topics, such as this Enterprise Hard Drive Charts that has 15 benchmarks for multiple manufacturers.

Next, I will describe a closely related topic to DAS.

Network Attached Storage

Even though relying on DAS is the first option by which to store a web application's data, as a web application and its parts grow, demand for storage space can grow into the Tebibyte & Pebibyte range. Depending on the hardware box, this makes it difficult -- and often times impossible -- to increase the amount of DAS on account of limitations related to a HBA.

In order to ease managing these larger amounts of data and avoid any hardware limitations related to a HBA, a web application's data storage is often delegated to a separate box that can communicate with a host box as if it were directly attached. This is Network Attached Storage or NAS.

NAS is a hardware box -- often called appliance -- with all the components & resources you would come to expect of a box (e.g. RAM, CPU, Bandwidth), with a large part of them being hard drives. Communication between NAS and a host box takes place through network file system protocols.

Unlike HBA/DAS protocols, network file system protocols have a lower efficiency than their counterparts. Among the most popular network file system protocols you will find the Network File System(NFS) protocol -- for Unix/Linux file systems -- and the Server Message Block (SMB) protocol -- for Windows file systems. These last protocols allow a host node access to files as if they were directly attached hard drives, when in fact they are present in hard drives located across a network in another box. Figure 3-3 illustrates the communication between NAS and a host box.

NAS communication
Figure 3-3 - NAS communication.

As you can see in this last figure, the hard drives present in NAS -- including its files -- all seem locally mounted on a host box or its nodes. A NAS therefore -- just like DAS -- can use any pertinent strategy, like RAID to increase fault tolerance or data throughput, as deemed necessary. In essence, NAS serves as a storage extension for a box or node, when DAS becomes unfeasible due to costs or hardware limitations (i.e.HBA limits).

NAS solutions come in many forms. For example, it's possible to convert a standard box or node and make it fulfill the duties of NAS. This last process is achievable with software like openfiler or other distributions like FreeNAS and Turnkey Linux , that combined with RAID technology can create pretty robust NAS solutions. On the other hand, there also turn-key NAS solutions like HP's NAS Systems or EMC's Celerra , which are appliances that come pre-installed with a series of features (e.g. hardware RAID, administrative software, backup/tape facilities).

As mentioned earlier, unless you deal with managing the data center that host's your web application, you probably won't have to deal with the details involved in using NAS, nevertheless it's likely a provider uses NAS to support some part of your web application's production environment. For cases where NAS isn't used, another variation that if often employed is a SAN.

Storage Area Network

A Storage Area Network or SAN represents a more sophisticated solution for storage than NAS. Unlike NAS which uses network file system protocols to set up communication between a host node and NAS, a SAN relies on protocols like SCSi or Fibre Channel, which as you might recall, are precisely protocols used by DAS. So why the need for the intermediate NAS ? Why not go from DAS to SAN being they use the same protocols ? Cost.

Since a SAN relies on DAS protocols, it's essentially a box -- often called appliance -- with multiple HBA's that are capable of delivering storage capacities beyond those of DSA, but without sacrificing throughput and I/O capacity needing to go through network file protocols. A SAN is therefore considered a high-end storage solution, since it can replicate the behaviour of dozens or hundreds of hard-drives as if they were directly attached to a host.

You may still be left wondering, if both NAS and SAN are individual boxes (i.e. appliances) what is their difference ? NAS is interconnected through network cables -- since it relies on network file system protocols -- this not only makes it less expensive, but also easier to cobble up different boxes with storage devices, albeit all with a lesser efficiency than DAS. A SAN on the other hand is interconnected through HBA's -- just like DAS -- this not only makes it more expensive, but also more difficult to put together, albeit it has the same efficiency as DAS.

In practice though, you will find that NAS/SAN hybrid appliances and architectures are becoming common. This allows both DAS type protocol support -- for high-performance applications -- as well as network file system protocol support without incurring in the cost penalties of DAS, all while offering storage capacities beyond those of DAS.

Another difference often associated between NAS and SAN solutions is in relation to storage type, the sidebar has more information on this topic.

Note File-based vs. Block-based storage

Application's store data in files to make data management user friendly. In order to store files, you format hard-drives with a file system. This results in file-based storage.

An application's data, whether it belongs to a web server, permanent storage system or is static in nature, is easily managed due to the inherent human readability of files. In addition, since data is in files, this means data is accessible via a network file system protocol, which implies NAS storage technology is suitable over the more expensive SAN storage technology.

However, even though file-based storage is prevalent in most applications, there is another alternative called block-based storage. In block-based storage, you store data in blocks, which is a format more ad-hoc to hard-drive technology.

Block-based storage is less used because it's not as user-friendly, even though its more efficient. The efficiency stems because hard-drives actually read and write data in blocks. Files are just abstractions over a series of blocks in a hard drive, something which makes reading files more elaborate than data blocks. The lack of user friendliness is due to block-based storage lacking human readable conventions, in other words, data is just composed of blocks (e.g. block 1, block 2, block 3, etc.) and its up to an application to pin-point the data's location.

For this reason, block-based storage is mostly used used in scenarios where performance is critical. Many permanent storage systems (e.g. RDBMS) are configurable to work on block-based storage -- also often called raw devices -- to enhance performance. Another scenario can involve data backups, a process which often relies on block-based storage to ease the transfer of large amounts of data.

Due to the nature of block-based storage (i.e. its lack of files), it can only take place on DAS/HBA hardware or SAN devices. As a result, block-based storage is mostly found on high-end permanent storage systems and backup systems. In addition, you should also note that an application should be capable of reading and writing blocks to take advantage of the performance benefits of block-based storage, which is yet another reason block-based is not so prevalent.

Finally, it's worth clarifying that you should not equate block-based storage with SAN and file-based storage with NAS. While block-based storage does need a SAN-type architecture, SAN-type devices can also be used for file-based storage. In this last scenario, file-based storage gains from the greater data transfer speeds associated with SAN devices -- compared to NAS -- albeit not to the fullest extent possible using block-based storage. However, given the less user friendliness of block-based storage, in addition to applications being required to specifically work with data stored in blocks, SANs using file-based storage to support static files, web servers and even medium-sized permanent storage systems are common.

Clusters and the Beowulf project

Just as virtualization technology allows partitioning a hardware box into multiple nodes to efficiently use its resources, a cluster is the pooling of multiple hardware boxes or nodes for the purpose of attaining higher performance and scalability levels.

Clusters use a myriad of technologies, which include high-end storage hardware and high-performance CPUs, as well as operating systems and application software especially designed to take advantage of a cluster's design and resources. However, if you rely on some of these cluster technologies it can have a heavy toll on costs and maintenance of a web application.

In order to cut the reliance on specializ hardware to create clusters and yet still be able to leverage large quantities of computing resources, many initiatives have emerged. One of the most important ones in this realm is the Beowulf project .

The Beowulf project aims to establish the blueprints necessary to build scalable performance clusters based on commodity off the shelf hardware (COTS), using a private network with open source software (e.g. Linux). Such a design substantially reduces the need to use specialized equipment and as a result the overall cost of deploying a high performance and scalable web application.

It should also be noted that the Beowulf project is not a product per se. As such, there are many Beowulf vendors , as well as books on how to engineer a Beowulf-style cluster.

A Beowulf cluster can be as simple as two networked computers each running Linux and sharing a file system or as complex as 1024 nodes with low-latency connectivity (e.g. Fibre Channel as used for DAS or SAN). It's important that you realize that unlike the nodes used to describe virtualization, a cluster's nodes -- whether categorized as Beowulf or not -- are aware and work with one another. This last fact is vital to understanding the concept of clusters as illustrated in figure 3-4 .

Cluster architecture
Figure 3-4 - Cluster architecture.

As you can see, in addition to nodes being interconnected by means of hardware -- network cables or host bus adapters (HBAs) -- there is also a software layer to each node which makes them work together. Something which makes clusters a centerpiece for enabling distributed computing applications. Most software technologies that are proven to deliver high levels of performance and scalability in web applications work under this principle.

Note Cloud computing services, clusters & custom software

Most cloud computing services run as clusters. This is what allows applications running on them instant access to vast pools of resources like bandwidth, CPU, memory and storage, all without requiring changes to an application's underlying structure. However, there is a catch to obtaining the benefits of running an application in a cluster or cloud computing service: custom software.

The nature and complexity of this software layer that makes cluster nodes work together can vary highly depending on provider.

For example, many cluster versions for RDBMS software differ from their non-cluster counterparts in the way they're administered and deployed. This favors application designers since it doesn't require learning new techniques or re-writing large parts of an application to use a clustered RDBMS version.

On the other hand, certain clustered environments and cloud computing services, allot this software layer onto application designers. In such cases, it requires application designers to learn new techniques and incorporate them into an application's design to take advantage of a cluster.

For the moment having this basic understanding of clusters should suffice. The next chapter goes deeper into the topic of clusters in the section on horizontal scaling, while the chapter that follows introduces you to the software layer used by cloud computing providers. In addition, other parts of the book also touch up on the topic of clusters to describe in-depth topics related to technologies that depend on this architecture.

Next, I will switch gears over to another broad topic related to performance and scalability in web applications.

Programming languages and web frameworks

You can create web applications in many programming languages now a days. Some languages are more popular than others for a number of reasons. Some form part of college curriculums granting them an ample user base, others are chosen based on corporate policies and often times they are chosen just because they are the latest novelty or language 'du jour'.

Even within programming languages, there are often many options from which to choose to create web applications. Such options which are often referred to as 'web frameworks' are also chosen from an ample set of criteria, ranging from the simple (e.g. 'its the most stable') to the complex (e.g. 'its the best option for building web applications that integrate with commercial CRMs').

Selecting programming languages and web frameworks solely on performance and scalability criteria is rare, not only because performance and scalability are future-orientated metrics, but also because they are contemptuous topics when pitted in the form of programming language X vs. programming language Y or web framework A vs. web framework B. I will explain why this is so.

More often than not, the criteria which has the most weight in choosing one language or web framework over the other boils down to a single word: productivity. The word productivity takes on a life of its own in business. As a result, productivity touts many other factors in web applications, chief among them performance and scalability. You needn't look further than your own skill set.

If you were to take 10 times as much developing an application in a programming language or web framework you knew little about, but which had 10 times better performance and scalability, would you switch ? How about one with 20 times better performance and scalability, but that would still take you 10 times as much developing ? Surely you would ponder your answer. Is investing 10 times the effort worth it ? Can't I gain in performance and scalability without hurting my productivity to such a degree ?

The importance given to productivity is that it's a metric measured in real-time. Daily, weekly or monthly milestones are very objective metrics to measure the output of a development team. As is years of experience in a particular programming language. The accumulated dexterity of a development team can quickly be lost by simply changing a web framework, even if using the same programming language.

Productivity can and is measured against the bottom-line of a project on a constant basis, where as performance and scalability are not. As I've already mentioned, performance and scalability issues don't appear until a web application has attained considerable demand, which can often taken months or years, if ever.

As a result, when it comes to choosing a programming language or web framework, most application stakeholders err on addressing pressing performance and scalability issues once demand has built up. Instead of shifting resources to a 'better' programming language or web framework which is said to have 'better' performance and scalability qualities, that may eventually be needed or not.

I used quotes for the term 'better' in the previous paragraph, because when it comes to comparing performance and scalability issues between programming languages and web frameworks, it's a contemptuous topic. In this case, I will refer you not to your own skill set, but rather a series of sites which are among the most visited and use a variety of programming languages to develop their web applications . Table 3-3 has the programming languages and web frameworks which are publicly known to back such sites and the average number of monthly visitors.

Table 3-3 Programming languages & frameworks used on highly visited applications.
SiteProgramming languagesVisitors per monthVisitor SourceTechnology Source
youtube.com Python82.8 million (U.S only) http://www.quantcast.com/youtube.com Seattle Conference on Scalability: YouTube Scalability
wikipedia.org PHP69.3 million (U.S only)http://www.quantcast.com/wikipedia.org Wikipedia Software and Hardware
scribd.com Ruby28.7 million (Global) http://www.quantcast.com/scribd.com Scaling Scribd
digg.com PHP27.0 million (Global) http://www.quantcast.com/digg.com How Digg Works
linkedin.comJava22.5 million (Global) http://www.quantcast.com/linkedin.com LinkedIn - A Professional Network built with Java Technologies and Agile Practices
typepad.com Perl15.1 million (Global) http://www.quantcast.com/typepad.com How Typepad changed their architecture without taking down the service (Changing your tires at 100mph - Scaling Typepad Sixapart)
plentyoffish.com .NET5.2 million (U.S only)http://www.quantcast.com/plentyoffish.com ARCast.TV - PlentyOfFish.com How one man beat the big guys
apartments.com .NET2.2 million (U.S only) http://www.quantcast.com/apartments.com Microsoft Case Studies apartments.com

Table 3-3 has a series of sites tilted toward a specific programming languages for developing their applications. Inclusively, larger sites -- such as amazon.com or ebay.com -- often rely on a mix of programming languages to put together their applications. But as you can attest by the data presented in this last table, practically any programming language can perform and scale at levels beyond 1 million visitors per month.

Of course, there is more to achieving these levels of performance and scalability than a programming language. But it should be clear that whatever your choice, a programming language per se will not be the sole deterrent in achieving your performance and scaling goals.

Considering the information presented in table 3-3 and that working with a programming language consumes a great part of an application's development time. Its sound advice to rely on the programming language your most productive with, incorporating performance and scalability techniques as they apply to a language. While relying on other web application design techniques -- beyond the programming language -- to boost overall performance and scalability.

In Part III of this book I will describe performance and scalability best practices for a series of programming languages used in web applications. And since all web framework's purpose is to simplify the usage of a particular programming language in the context of web applications, the information in Part III will serve you no matter what web framework you choose.

Next, I will introduce you to a series of web frameworks that have grown in popularity in recent times and which you're likely to have used or at least heard about. I have found them widely adopted given the productivity gains they offer, so I feel its important to get in a word about them performance and scalability wise.

Scaffolding saving web frameworks

Patterns are a common occurrence in software and they are no exception to the realm of web applications. Irrespective of the language or framework you choose, nearly all web applications have recurring patterns.

Such patterns are in fact so common, that many web frameworks support them in one way or another. These patterns run the gamut, from the use of HTML templates to favor reuse, web controllers to manage a web application's growth effectively, to more elaborate techniques like dependency injection or Object Relational Mapping(ORM) to ease queries against RDBMS.

Though all web frameworks attempt to cut the amount of work needed to complete an application by incorporating such patterns, the emergence of the Ruby on Rails framework set a precedence in this space.

Based on the Ruby language, Ruby on Rails struck a chord among many people due to the productivity levels it offered. This productivity came in the way of reducing the amount of scaffolding needed to produce a web application. Understand by scaffolding, all the code, configuration files and setup procedures that are often by-products needed to create a web application. Which is to say, time spent doing work that isn't reflected directly in a web application, all of which comes at the expense of productivity.

The way Ruby on Rails achieves this productivity level is by making a series of assumptions when creating an application, generating the necessary scaffolding upon making one of a series of primary actions.

For example, assume your to design a web application for baseball cards. Upon defining a baseball card, Ruby on Rails generates an HTML template, a web controller for defining actions, a model class for persisting baseball cards to a database via ORM, as well as link everything together so the web application is accessible instantly from a browser. Once the framework generates this scaffolding, you can dedicate yourself to customizing the web application's business logic and content, which is were the most productive time is reflected. Though this entire process comes at the cost of assuming a series of details, for most cases it saves a considerable amount of time.

But what if you don't use Ruby ? As it turns out, Ruby on Rails design was such a success that many other programming languages have implemented similar frameworks. Following in similar steps you will find Cake for PHP , Pluf for PHP , Django for Python, Roo for Java , Grails for Groovy -- the last of which is a Java compatible language -- and MonoRail for .NET .

This means that whatever programming language you opt to use in your web applications, it's likely a web framework with the same 'scaffolding saving' characteristics as Ruby on Rails exists. Hence, in order to pick-up the pace of the book and not get bogged down on too many low-level web application details, I will rely on some of these scaffolding saving frameworks to create a base application and illustrate several performance and scalability scenarios throughout the book.

But what if you really aren't interested in using these scaffolding saving web frameworks ? Though I would agree there are many and often more sophisticated web frameworks for creating web applications, these scaffolding saving web frameworks are quite popular and easy to set up. So as long as you're familiar with the underlying programming language, you should have no problem following along and installing these web frameworks. I'm not trying to impose scaffolding saving frameworks on your web applications, I'm simply going to rely on them to illustrate several performance and scalability concepts without having to write too much code. The adjacent sidebar touches up on the topics of performance and scalability as they relate directly to scaffolding saving web frameworks.

As I've already mentioned, bear in mind the programming language and web framework you choose will only be a fraction of the overall performance and scalability throughput of a web application. Nevertheless, I will strive to illustrate programming language techniques and avoid web framework specific topics when at all possible, so you can use this knowledge on any other web framework that targets your programming language of choice.

Note Do scaffolding saving frameworks scale ?

Scaffolding saving frameworks like Ruby on Rails -- and similar frameworks for other programming languages -- are often criticized because they are said to not scale. Many of the reasons though are often unfounded.

While its true that scaffolding saving frameworks make assumptions and often simplify design choices to make things easier, many of the scaling problems attributed to these web frameworks are of a more general nature.

Things like web server tuning, permanent storage system tuning, run-time environment(e.g. Java) configuration, a monolithic business logic tier or sub-optimal code, can all contribute to degrading performance and scalability irrespective of using these types of frameworks.

So re-expressing what I mentioned earlier, the performance and scalability issues of web applications will rarely be attributed to a single factor, such as a programming language -- as illustrated in table 3-3 with multiple sites using diverse languages -- or the use of a scaffolding saving framework.

Web applications and data access

A web application constantly works with data. It will either save data to a permanent storage system for future reference, read data from a permanent storage system to present it to end users, delete data once its longer relevant, update data from said permanent storage system to reflect changes or manipulate it in some other way.

You will note that in this last paragraph I chose the wording 'permanent storage system', this was intentional. Though in many web applications a permanent storage system often takes the form of a Relational Database Management System (RDBMS), as an application's performance and scalability demands increase, the volume and type of access operations performed by a web application can reach a level to make this type of permanent storage system limited or inadequate.

I will elaborate on the different types of permanent storage systems in the next section, but here I will elaborate on data access operations and how they have their own characteristics depending on an application's purpose.

Online Transaction Processing - OLTP

Most web applications are OLTPs, short for online transaction processing systems. A transaction in this context is just a unit of work. Work along the lines of an application creating a new record for a player, updating data for a pre-existing player, deleting data from a player's record or just reading a player's record for display by an application.

In web applications this data access practice is so common that a four letter acronym is often used to describe it : CRUD, which stands for Create, Read, Update and Delete. In order to create, read, update or delete data from any permanent storage system, a programming language requires a way to interact with it. This process involves using a specific programming language API provided by a permanent storage system vendor. Generally speaking, permanent storage vendors provide APIs for most mainstream languages (e.g. Java, PHP, .NET, Python and Ruby).This means the only thing you're required to do to execute CRUD operations is intersperse whatever CRUD calls are appropriate into your application's code using the adequate API (e.g. createRecord("John Smith"), updateRecord(133,"Tom Jones"),etc).

In an effort to simplify the usage of a programming language API to do CRUD operations, many strategies are available, all of which can have important performance and scalability implications. Some of these strategies include: special design tools, intermediary languages (a.k.a. query languages), simpler APIs built on top of an original vendor's APIs, web frameworks tightly integrated with certain vendor APIs, among other approaches.

For the sake of choosing an example, I will discuss the default CRUD strategy used by some of the scaffolding saving web frameworks mentioned earlier. This strategy involves using a technique called Object Relation Mapping(ORM) and a RDBMS as the permanent storage system.

ORM is one of the more high-level strategies used for performing CRUD operations on a RDBMS -- and is inclusively used by a wide array of web frameworks, in addition to the scaffolding saving frameworks I've already mentioned. When I use the word 'high-level', its to imply that much of the work involved in performing CRUD operations using ORM is delegated to an Object Relational Mapper -- an API with pre-defined algorithms for performing CRUD operations.

In contrast, a 'lower-level' strategy for performing CRUD operations on a RDBMS would have you use Structured Query Language(SQL) -- the de facto language for performing CRUD operations on a RDBMS -- directly in a web application. By relying on ORM, you avoid mixing a web application's business logic with SQL, thus working at a 'higher-level' -- with an Object Relational Mapper providing pre-defined SQL algorithms.

The actual approach you take to performing CRUD operations (e.g. programming language API, SQL, ORM or other) in your OLTP applications, will largely depend on what your permanent storage system supports and what fits your application's requirements best (e.g. the chosen framework, the complexity of CRUD operations).

Online Analytical Processing - OLAP

A web application designed for OLAP -- short for Online Analytical Processing -- uses a different set of data access patterns than one designed as an OLTP. Analytical processing systems are mostly concerned with data retrieval, in addition to dealing with vast amounts of data relative to OLTPs. Another characteristic of OLAPs is that they seldom make data updates, insertions or deletions, since analytical processing by its nature rarely requires to do these data operations.

OLAP applications are often referred to by a host of other names, that can include: data mining, batch processing and business intelligence. In all cases, the purpose is the same, to analyze large amounts of data and get insight from it. Since data access patterns share a particular form, certain types of permanent storage systems are often the preferred choice for backing OLAPs, a topic I will address in the upcoming section on permanent storage technologies.

However, similar to OLTPs, OLAPs access data using a specific programming language API provided by a permanent storage system vendor. Generally speaking, permanent storage vendors offer APIs for most mainstream languages (e.g. Java, PHP, .NET, Python and Ruby).

In an effort to simplify the usage of a programming language API to do analytical processing, a series of permanent storage vendors have established standards to ease executing analytical queries. Similar to how OLTPs have settled on the SQL standard or some 'higher-level' abstraction of it, OLAPs often rely on either XMLA (XML for Analysis) or MDX (Multidimensional Expressions) to execute their operations.

Though just like their counterparts, the actual approach you take to performing analytic operations (e.g. programming language API, XMLA, MDX or other) in your OLAP applications, will largely depend on what your permanent storage system supports and what fits your application's requirements best (e.g. the chosen framework, the complexity of the analytic operation).

Parallel OLTP and OLAP

For certain web applications, data access is so demanding its made in parallel, something that by simple association and as explained in the chapter on fundamental concepts makes a system distributed. Parallel data access can involve both OLTPs and OLAPs, though OLAPs are a far more common scenario given their work with larger amounts of data.

Parallel OLTP and OLAP creates a more elaborate scenario than having an ordinary OLTP or OLAP, since data is distributed over multiple locations or nodes to achieve parallelism. This also creates a different set of rules, for both the data access mechanism and the permanent storage system.

For doing parallel OLTP and OLAP systems -- just like ordinary OLTP and OLAP systems -- there is a preferred set of permanent storage technologies. I will address these technologies in the upcoming section on permanent storage technologies. In the case of data access mechanisms, even though certain parallel OLTP and OLAP systems can rely on some of their regular counterparts data access mechanisms (e.g. programming language APIs, SQL, XMLA), a data access mechanism specifically designed to address parallelism has emerged: MapReduce.

It's very important to understand that MapReduce is an access technology (i.e. just like an API), since it's often a common misconception that MapReduce is a permanent storage system. Even though MapReduce is closely related to permanent storage technologies, it's not one per se and can actually work with several permanent storage technologies.

Just as CRUD operations rely on an API to create, read, update and delete data, MapReduce consists of an API to map and reduce data. Since the verbs of mapping and reducing data are not as intuitive as creating, reading, updating and deleting data, MapReduce is best explained through an example.

The first step to easily managing a large data access operation would be to reduce the amount of data you're working with. If you guessed the Reduce in MapReduce has something do with achieving this, you're on the right track. MapReduce consists of executing two operations on data. A Mapper operation which consists of mapping data into a particular structure, which is then fed into a Reduce operation charged with reducing the data on the basis of particular set of rules. Map and Reduce functions are open-ended, meaning you can Map data based on dates, characters, counts or some other arbitrary value, as well as Reduce data by applying rules based on such things like names, cities, countries or some other arbitrary value.

At a very simple level, this is all MapReduce is about, mapping and reducing data. Straightforward, right ? Indeed, but don't lose sight of the fact that MapReduce has one particular characteristic in the way it does these operations, it can do them in parallel. The mapping of data can take place across several nodes, as well as the reducing of data, making it an expedited procedure -- relative to other mechanisms -- for accessing and processing vast amounts of data.

But these mapper and reduce functions sound complicated to design, do they not ? How will MapReduce know where to locate data ? And that it isn't processed -- mapped or reduced -- twice ? Or that data isn't lost by a node suddenly dying ? No worries, MapReduce is designed from the ground-up to deal with this type of distributed computation. The mapper and reduce functions are managed in such a way that their workload is distributed and executed in an orderily fashion, avoiding all the pitfalls of distributed systems.

So don't worry about dealing with the underlying complexities of processing data in parallel, there are several MapReduce implementations for several programming languages, like Hadoop for Java, Twister also for Java Skynet for Ruby and Mincemeat for Python, among many others. Note that the first MapReduce implementation, produced by Google for C++ and simply called the 'MapReduce framework' is not available to the public at large, but rather used internally to support their services.

Similar to how other data access mechanisms have catered to 'higher-level' abstractions (e.g. ORM for SQL), MapReduce has also blossomed its own share of 'higher-level' mechanisms that allow you to work above basic mapper and reduce functions, some of these approaches include Pig and Hive -- both of which are tightly pegged to the Hadoop implementation of MapReduce.

Finally, I will skip the details of how data is stored in order to be processed by MapReduce, since this is relative to permanent storage technologies, which I will address next.

Permanent storage systems

Information is constantly stored on a system for future reference. Whether it's a text document containing a personal letter, a report with quarterly sales or a company's workforce records. Web applications are no different. Not only do web applications constantly collect and manage large amounts of information, but occasionally they also generate a lot of it in the form of navigation and buying patterns.

However, unlike a typical PC or workstation in which nearly all information is stored in files with a format extension to associate it with a particular program, web applications rely on different technologies to store information. These different technology choices are attributed to the type of information being stored, which in turn influences the best retrieval and storage approach (e.g CRUD operations, analytical processing).

Throughout the years many technologies have emerged to store information for web applications. Some continue to set the norm in popularity, others have faded into obscurity relegated to niche areas, while other choices are still experiencing initial growth that has yet to prove sustainable.

But before embarking on the topic of permanent storage systems, I will describe a series of topics that strongly influence how permanent storage systems work and how they're often categorized: File systems and ACID properties.

File systems

In order to talk about permanent storage systems, its necessary to introduce the lowest-common denominator on which permanent storage systems rely on: File systems. File systems are especially important to our discussion, since some of the permanent storage systems that have gained popularity in web applications depend on particular file systems for their operation. In addition, some file systems have certain limits, as well as characteristics that can make some file systems better suited for certain circumstances.

File systems generally work behind the scenes in most applications. If not for certain savvy users or system administrators in general, file system types often go unnoticed. However, this doesn't mean there aren't many options to choose from, although a few options command most of the file system market.

Local file systems

An OS and all its applications permanently store data on some type of file system. A file system's task is simple, to manage all these files on behalf of an OS. Files can range form part of the core OS, an application's internal files, images used throughout applications or standalone HTML documents.

The majority of Windows workstations and servers rely on the NT file system(NTFS) or the older File Allocation Table(FAT) file system, where as a large part of Unix/Linux workstations and servers use an EXT*(EXT2,EXT3, EXT4) file system.

Note

File system support is provided by the OS. Most Unix/Linux OS -- or more precisely their kernel -- support a series of file systems, including those used in Windows OS. Windows OS on the other hand, by default come with no support for file systems used in Unix/Linux OS, though third-party drivers are available for some file systems.

File systems vary in the amount of features they offer, which as I've already mentioned are often inconsequential to most users. However, some features are important for certain scenarios, table 3-4 illustrates a series of features and their support in various file systems.

Table 3-4 File systems and features
FeatureFATNTFSEXT2EXT3EXT4REISER4JFSXFSDefinition
File owner-XXXXXXXAssigns an owner to a file, providing enhanced security.
Soft links-XXXXXXXA file that contains a reference to another file or directory.
Case sensitive-XXXXXXXFile names are case-sensitive(e.g. example.txt and Example.txt are different files names in case sensitive file systems).
Journaling support---XXXXXA journal uses a log to track structural changes made to files. In the event of a system crash, a journal can verify the integrity of a file system, avoiding a lengthy file system check, saving time and preventing file corruption.
Encryption-X----X-Files or directories can be encrypted individually.
Maximum file name length8 characters or 255bytes using LFN255 bytes255 bytes255 bytes255 bytes3976 bytes255 bytes255 bytes 
Maximum file size32MiB-2GiB-4GiB16EiB16 GiB-64 TiB16 GiB-2 TiB16 GiB-16 TiB8 TiB4 PiB8EiB 

As illustrated in table 3-4, there are various features incorporated into a file system's design that can make them a more favorable option. However, bear in mind these are design features, another matter altogether are these file system's performance metrics that can vary depending on the type of data (e.g. file size and type) being stored, as well as system hardware (e.g. hard drive type [e.g. SCSI] and RPM speed).

File system performance is explored in Part II of the book as it relates to web servers dispatching static files. Where as Part IV of the book will go into the finer details of file systems and their effect on permanent storage systems.

Note No file system - Raw storage

Some applications often forgo the use of file systems entirely. This means information is stored in blocks rather than files. Though this technique is often used to enhance I/O performance and scalability, it comes at the price of reduced human readability.

The previous sidebar entitled 'File-based vs. Block-based storage' elaborates on this particular scenario of using no file system.

Networked file systems

Irrespective of the local file system you use, it's also possible to reach the contents of a file system through a network. This allows an OS to read and write to a file system as if it were present in hardware physically attached to a host node, when in fact it would be on a remote node. This in turn grants an OS the potential to reach dozens or hundreds of hard drives, that would otherwise not be accessible in a single piece of hardware due to physical limitations.

In the case of web applications -- most of which reside in high-end data centers -- there is often little use for network file protocols. Instead, more sophisticated mechanisms are common for cases where file systems are not present in hardware physically attached to an OS host node. These mechanisms include Storage Area Networks(SAN) and Network Attached Storage(NAS) appliances, topics discussed earlier.

However, for the sake of completeness I will mention two of the most widely used network file protocols. For Unix/Linux file systems the Network File System(NFS) protocol and for Windows file systems the Server Message Block (SMB) protocol. In order to reach file systems via the NFS or SMB protocol, its necessary to run a corresponding server process on a host OS (e.g. NFS server or SMB server) and a client program on the accessing OS (e.g. NFS client or SMB client).

Other network file system protocols you might encounter are: WebNFS -- an extension to NFS -- for accessing a file system over the internet in a firewall-friendly way and Common Internet File System Protocol (CIFS) designed for SMB as an equivalent to WebNFS.

Parallel file systems

As explained earlier, parallelism is the act of executing simultaneous tasks leveraging more than a single resource point or node. In the case of file systems, a parallel file system consists of distributing the contents of a file system across multiple resources points or nodes. In this context, a node is a hard drive or group of hard drives. There are various reasons why parallel file systems are often favored as a storage mechanism, the main ones include:

Since a file's contents are distributed out over several nodes to allow parallelism, parallel file systems rely on meta-data to manage files. When you write a file to a parallel file system, meta-data about the file (e.g. on which nodes its saved) is stored. This meta-data allows the file to be put back together when it's retrieved. Depending on the parallel file system, meta-data is stored in either a dedicated node or distributed across the same nodes that store files. Figure 3-5 illustrates this architecture.

Parallel file system components
Figure 3-5 - Parallel file system components

As you can see in figure 3-5, parallel file systems need a series of components to be running on the participating nodes. A client or application -- the one requiring the file -- first needs to get meta-data about a file, this requires a client to use a program to obtain meta-data. In turn, the node on which this meta-data is located -- either dedicated or distributed across the same nodes that store files -- also needs to run a program to make meta-data accessible to a client. In addition, the nodes used to store files also need to run a program to be accessible by a client. Depending on the parallel file system, such components will either take the form of pre-built programs, be available through an API or inclusively be integrated as part of a node's OS.

Some of the more popular parallel file systems include: Lustre , IBM General Parallel File System , Parallel Virtual File System , GlusterFS and Ibrix Fusion File System .

From a web application point of view, these are auxiliary tasks more in tune with system administration, after all they are still file systems. However, since some parts of a web application can be built on this architecture, its vital to understand these building blocks.

Parallel fault tolerant file systems

Even though parallel file systems address a series of concerns beyond those of local file systems, in their basic form they lack one important characteristic: fault tolerance.

In order to support fault tolerance, any file system -- local, network, parallel or even no-file system -- can and should rely on RAID. After all, losing any component of a parallel file system is just as disastrous as loosing data on any other type of file system.

However, even though the 'I' in RAID stands for inexpensive, it's anything but. In the case of hardware RAID, this requires a dedicated appliance to host RAID controllers, in addition to the actual array of disks. In the case of software RAID, this requires substantial configuration effort -- often error prone -- as well as resources from a host OS. All this 'on top' of the administrative overhead of running a parallel file system, given that -- as illustrated in Figure 3-5 -- parallel file systems require communication between nodes and the underlying client to work appropriately.

To reduce the need to use RAID in parallel file systems, a series of parallel file systems have emerged that come with integrated fault tolerant features. Which is to say, a parallel file system is itself capable of withstanding a hardware failure -- one of its entire nodes -- and continue to run without interruption.

The way parallel fault tolerant file systems achieve fault-tolerance is like RAID. Mirrioring to multiple nodes is done on an entire file or its parts , so that if a particular node fails, a copy is still available to continue with the underlying task associated with a file or its parts.

Some of these parallel fault tolerant file systems include: Kosmos , the proprietary Google File System , Mogile File System and the Hadoop File System .

If you use any of these parallel fault tolerant file systems, you will often find the term commodity hardware or COTS (Commodity Off the Shelf Hardware) closely linked to their use. This is because it's one of the main attributes of using fault tolerant file systems. Instead of requiring appliances with RAID controllers or elaborate software RAID configurations to ensure fault tolerance, parallel fault tolerance file systems can be run on inexpensive commodity hardware. Since fault tolerance is built-into the parallel file system itself, node failures are absorbed by the file system.

Some of the permanent storage solutions discussed in the upcoming sections work on the back of parallel fault tolerant file systems. Applications that require large amounts of hardware nodes, fault tolerance and data processing throughput -- such as search or other data mining type processes -- have found a cost-effective solution in parallel fault tolerant file systems, since they reduce the hardware investment needed to fulfill their operation.

Note Parallel fault tolerant file systems vs. RAID

Now that you know about parallel fault tolerant file systems, I can address the similarities to RAID. Since RAID operates on techniques like mirroring, striping and parity, data is effectively present in multiple members of an array. This makes RAID have the same set of features provided by parallel fault tolerant file system: increased throughput by allowing an application to read/write data to multiple locations, in addition to fault tolerance.

However, RAID differs from parallel fault tolerant file systems in several ways. For instance, RAID per se is not capable of storing files larger than the underlying file system it backs (e.g. a 4GiB file on a FAT file system RAID array), where as a parallel file system can manage large files by stripping them across several nodes. RAID can also strip data, but not with the intent of partitioning large files, as parallel file systems do.

Another difference is in cost and setup. Hardware RAID requires specialized hardware which can make it expensive, where as software RAID can become complex on account it's configured at the OS level. Parallel fault tolerant file systems are different because their functionality is baked-in, which in turn reduces cost -- no need for specialized hardware -- and configuration -- simply install the parallel fault tolerant file system and be done.

For a second, you might be thinking of using parallel fault tolerant file systems to host your static files, web server and pretty much everything! After all, they seem to have the same characteristics of RAID without the configuration and cost overhead. Well they actually do, but there is one little problem: An application needs to be designed to interact with a parallel fault tolerant file system.

Of the various parts that compromise a web application, it's mostly been permanent storage systems that have been built to work and take advantage of parallel fault tolerant file systems. In the upcoming sections I will discuss some of these permanent storage technologies that in fact need to run on top of a parallel fault tolerant file system to work.

While it's certainly true parallel fault tolerant file systems have eaten away at what was once RAID's sole domain. But until all parts of an application -- not just its permanent storage system -- are designed to take advantage of parallel fault tolerant file systems, RAID will continue to be used to guarantee an application's fault tolerance and increase its throughput.

Now that you've had a refresher on file systems, I will describe a critical set of properties associated with all permanent storage systems, which will follow the permanent storage system technologies themselves.

ACID properties for permanent storage systems

Permanent storage systems offer several guarantees for the data they store. Ever since the first storage systems appeared, it has been common to refer to such guarantees by a series of properties: Atomicity, Consistency, Isolation and Durability, often referred to by the four latter acronym ACID. Under most circumstances, ACID properties are an afterthought for most application designers, which is actually a good thing, since permanent storage system are responsible for enforcing ACID properties.

However, if you'll be selecting from among several permanent storage technologies, it's important you understand the characteristics of these ACID properties and how they can influence your application's data. Some permanent storage systems, to handle larger data volumes or increase data throughput, do not enforce these ACID properties in a stringent way. While this is beneficial and even a necessity for certain web applications, it could be unacceptable for other web applications.

In permanent storage systems, every data operation is associated with a transaction. A transaction is a unit of work that can include a single data operation or multiple data operations. Two terms closely associated with transactions are commits and rollbacks. A commit indicates a successful transaction, meaning all the underlying data operations were applied successfully to a permanent storage system. A rollback on the other hand indicates a failed transaction, meaning something went wrong with the underlying data operations and the permanent storage system had no changes applied to it.

Atomicity is a property that guarantees a transaction is indivisible and irreducible. Atomicity is best illustrated with 'all or nothing' behavior. Take a transaction composed of a data operation named FOO and another data operation named BAR. Atomicity is achieved by a permanent storage system applying all or none of the logic in FOO and BAR to the underlying data. This guarantees the unit of work isn't applied partially, but rather in an atomic fashion. In other words, the transaction commits only if all the logic in FOO and BAR is applied to the underlying data or rolled-back if a problem occurs attempting to apply FOO and BAR the underlying data.

You'll often hear permanent storage systems offering atomicity levels. Atomicity being enforced on a user's session, guarantees a user's transactions are atomic; atomicity being enforced at a system level, guarantees a system's transactions are atomic; atomicity being enforced on a data structure, guarantees transactions made on said data structure are atomic. The 'best' atomicity depends on the type of web application and data you handle. Different atomicity levels often imply different techniques (e.g. locking or multiversion concurrency control) that can influence other transaction ACID properties, as well as requiring more or less resources to make them, directly translating into performance issues.

Consistency is a property that guarantees data integrity in a permanent storage system. Permanent storage systems can enforce several rules on the data they store. These rules can range from data types (e.g. Strings, Integers, Dates), referential integrity rules (e.g. not being able to delete or update records without modifying another) to distributed integrity rules (e.g. committing transactions across several locations only if certain conditions are met). Permanent storage system comply with consistency, when they rollback transactions containing operations violating any data integrity rules.

Similar to atomicity, consistency in a permanent storage system can have several levels. Some permanent storage systems avoid enforcing consistency on data structures themselves -- something known as schema-less storage systems -- to make them more flexible, other permanent storage systems rely on techniques like 'eventual consistency' to enhance performance. Here again, there is no 'best' consistency level, it will all depend on the type of web application and data you handle, since consistency level can also influence other transaction ACID properties and performance metrics.

Isolation is a property that guarantees data being used in one transaction (e.g. TransactA) cannot be seen by another transaction (e.g.TransactB) until the first transaction (e.g. TransactA) has committed its results. Isolation is particularly important in applications with high concurrency, since it ensures operations are only executed on committed data and not on data that can still be in the process of being modified by non-committed transactions.

Isolation is closely tied to the atomicity property in permanent storage systems, since many of the same techniques are used to enforce it (e.g. locking or multiversion concurrency control). Similar to the other ACID properties, isolation is enforced at several levels. Where as enforcing the highest isolation levels can lead to severe performance degradations, relying on lower levels of isolation can lead to another set of problems like 'dirty-reads' or 'phantom-reads'. The 'best' isolation level will again depend on the type of web application and data you handle, by determining if its subject or non-subject to isolation problems.

Durability is a property that guarantees data or transactions are recoverable after a system crash. For example, if a hardware or software failure occurs 1ms after a transaction commits, durability guarantees there is some way to recover the data changes made by the committed transaction, as well as any part of the underlying data structures of the permanent storage system.

Once again, you'll often find yourself confronted with different durability levels when choosing storage systems. Some permanent storage systems offer data replication, allowing them to automatically recover any part of their underlying data structures even if a sudden failure ocurrs.However, as beneficial as this may sound, this type of durability can influence other transactional ACID properties.

Data durability for transactions is an even more delicate issue than the data replication of data structures. For example, immediately saving all committed transactions to a storage system is inefficient, since it constantly requires accessing a storage system's file structures. To ensure better performance, the durability degree of transactions is often relaxed by keeping them in-memory for a certain period time, however, such transactions can be lost forever in the event of a failure. Other degrees of durability can include immediately saving committed transactions to a separate storage file, where they can later be recovered in case of failure or applied to the storage system without the constant performance hit, thus avoiding the possibility of losing transactions forever. Here again, there is no 'best' level of durability, it will all depend on what your web application and its data can bear and the influence particular durability techniques can have on the other ACID properties.

Now that you're familiar with ACID properties and knowing how it's possible to relax their enforcement depending on the data they handle, lets move on to the permanent storage systems themselves.

Embedded databases

Every file managed by a file system has a certain structure. A JPEG image file has a particular structure to support its primary image compression characteristics, an HTML file has a certain structure to properly displayed in a web browser, a text file relies on characters and line feeds to organize its data, and so on for hundreds of file types.

As a basic function, every file serves as a permanent storage structure irrespective of type. Meaning a JPEG file stores image bits for future occasions, as does an HTML file store markup for future usage, as well as a text file stores records for posterity. However, some file types offer more sophisticated characteristics for storing information, a process that's specially important when concurrently accessing a file, since it's possible for changes to take place simultaneously.

Even though a web application could use a plain text file to permanently store records like names, addresses or other such things, a text file is ill-suited in many circumstances for this purpose. Among the deficiencies of using plain text files to permanently store information are concurrent access and system crashes.

If more than one user accesses an application and they attempt to change the records in a text file, an application's design must make sure concurrent access is safe, for it can damage data integrity. Equally, if a system crash occurs during a change to a text file, the data integrity of a text file is not guaranteed. Given these circumstances, instead of relying on a plain text file which can complicate an application's design -- ensuring concurrent access is safe or having the potential to lose data in a system crash -- application's often rely on an alternative approach which provides these features already : an embedded database.

An embedded database provides a file structure that enforces the ACID characteristics mentioned in the previous section. The term 'embedded' is in reference to how these databases accompany an application -- generally in the form of a single file -- and don't need a separate installation process or ongoing administrative maintenance like some permanent storage systems.

Embedded databases store their data in the form of key-value pairs. This provides a flexible and open-ended way of permanently storing information, since a record can store any type of data(e.g. letters, numbers) for both its key and value, without following a strict pattern.

The way to store and access key-value pairs in an embedded database is directly through an application's programming language. This access mechanism is why embedded databases are also often called libraries, since they're accessed a programming language's Application Programming Interface(API).

The open-ended nature of storing information and access mechanism by means of an API, also allows applications using an embedded database to store data in its native format. Data representations in programming languages are stored 'as is', without requiring an extra transformation step -- either for writing or reading -- which other permanent storage systems need.

One of the more popular embedded databases is Berkley DB. Though there are other embedded databases that have established themselves along this last popular option, these include InnoDB and Tokyo Cabinet . You can refer to the cited websites for more information on the various programming language APIs available for each option.

Embedded database are an ideal choice for storing and retrieving data with simple and static patterns. In addition, their key/value approach and support for ACID compliant transactions, make them one of the permanent storage system solutions with the highest throughput.

However, in the realm of web applications, it's rare to see embedded databases in wide use. The primary reason for this is that OLTP web applications mostly deal with complex and dynamic data patterns. Which is to say, web application users make data requests based on groups, dates, preferences and other variables, that can become quite complex and dynamic depending on the user making them. On the other hand, OLAP web applications mostly deal with complex and massive amounts of data, something that can make embedding this volume of data in an individual file impossible. For both of these circumstances, there are more efficient permanent storage solutions.

Rare exceptions of using embedded database in web applications can include authentication operations (i.e. user/password verification). Since authentication operations are simple -- 'Does x user match with y password ?' -- and have static patterns -- Yes or No -- embedded databases are known to have high throughput in such circumstances.

However, even for such cases, it's generally considered a solution for applications ranging into 100,000+ users. For most cases, authentication operations take place on data stored on the same permanent storage systems that excel at retrieving data using complex and dynamic patterns, which have also become the norm in most OLTP web applications: Relational Database Management Systems.

Relational Database Management Systems

Par excellence the choice for permanent storage in web applications has been Relational Database Management Systems(RDBMS). Long before web applications -- or the Internet for that matter -- RDBMS were the technology of choice for corporations wanting to store data.

The building blocks for RDBMS are what is known as Codd's 12 rules -- actually thirteen since they're numbered from zero to twelve -- proposed by Edgar F. Codd who pioneered the relational model for databases. Most RDBMS products aid in enforcing these rules, though often not in the strictest sense.

Derived from such rules emerged the primary entity for storing data in a RDBMS: tables. Table design in a RDBMS relies on Normal Forms described in Table 3-5.

Table 3-5 Normal Forms for RDBMS tables
Normal FormDefinition
First Normal Form(1NF)A RDBMS is in its first normal form if it tables contain no repeating groups and all columns are scalar values.
Second Normal Form(2NF)A RDBMS is in its second form when a table's redundant data is extracted and placed in a new table, creating relationships between those tables.
Third Normal Form(3NF)A RDBMS is in its third normal form when its tables comply with 1NF and 2NF, in addition to only having tables with columns that are fully dependent on a table's primary key.

The process of applying these Normal Forms -- called normalization -- is done with the purpose of ensuring data integrity, avoiding data redundancy and facilitating queries on the data stored in a RDBMS. Higher Normal Forms are always a desired goal, although achieving them is often dependent on earlier design experience and the complexity of the data represented in tables.

For example, listing 3-1 illustrates a RDBMS table for baseball players:

Listing 3-1 - RDBMS table with repeating groups
Player Table
IDNameTeam1Team2Team3DOBAge
1John YankeesRed SoxTigers198030
2Adrian Padres198228
3PeterRangersMarlins198030

This last listing represents a table that doesn't comply with any of a RDBMS normal forms. For instance, you have repeating columns for the Team field (i.e. Team1,Team2,Team3), something that in addition to violating 1NF has the potential of wasting disk space, since not every ID will necessarily have the same amount of Team fields. You could make changes to the table as illustrated in listing 3-2.

Listing 3-2 - RDBMS table with non-scalar values
Player Table
IDNameTeamsDOBAge
1John Yankees,Red Sox,Tigers198030
2Adrian Padres198228
3Peter Rangers,Marlins198030

This approach of creating a Teams field for every player record is a slightly better approach, but the design still fails the 1NF test, since the Teams field is not a scalar value (i.e. a field that can hold only one value at a time). By not using scalar fields, it becomes inefficient to search for records having a particular value. Thus the solution becomes to breakup the table, as illustrated in listings 3-3 and 3-4:

Listing 3-3 - RDBMS table 1NF and 2NF compliant
Player Table
IDNameDOBAge
1John198030
2Adrian198228
3Peter198030
Listing 3-4 - RDBMS table 1NF and 2NF compliant
Team Table
Player_IDTeam
1Yankees
1Red Sox
1Tigers
2Padres
3Rangers
3Marlins

In this design, the Team field is in its own table (Listing 3-4) and associated through a relationship. Note the relationship in the new table (Listing 3-4) made through the ID field of the player table(Listing 3-3).In RDBMS speak, the ID field is the primary key of the player table, where as the Player_ID field of the team table is the foreign key to the player table. In addition, by breaking up the player table and creating relationships to a new table, we automatically produced a 2NF RDBMS design!

What about 3NF compliance ? 3NF specifies that a table's fields be fully dependent on a table's primary key. In the case of the player table, the Name field is fully dependant on its primary key -- since there is no other way you can get it -- but in the case of the DOB and Age fields, there is something important about one of them being fully dependent on the primary key. The Age field and its corresponding values can easily be calculated from the DOB field, making it independent of the primary key. So in order for our RDBMS design to comply with 3NF, it will be necessary to remove the Age field from the player table.

For this particular data model, we're done with the normalization process. For more elaborate data models, normalization can become a complex task, particularly when applying the more advanced normal forms beyond 3NF. However, bear in mind there are no hard rules to doing normalization on a RDBMS data model, you can omit or apply normalization rules as shallow or deep as you like, though most RDBMS products practically enforce it on your design. In addition, take into consideration that an effective normalization strategy is beneficial when addressing performance and scalability concerns in a RDBMS.

A RDBMS data model or its of group tables -- just like the ones created -- is often refered to as a schema. Similar to the act of RDBMS normalization which ensures data maintain certain characteristics, an RDBMS schema ensures data integrity.

This means you can't add or leave out data arbitrarily from a particular a RDBMS schema at any given time, because it would lead to inconsistencies in data (e.g. Some player records with a DOB value, others without it; some player records with new fields, other without them). Note this is at any given time, you can define RDBMS schemas any way you please, you're just not allowed to apply certain schema rules to one data record and omit these same rules to other data records.

Another aspect of RDBMS schemas is that every table field needs to be assigned to a particular data type -- just like most programming languages assign data types. So for example, in the player table the ID field would be assigned as an integer, the Name field as a string, while the DOB field as a date. This characteristic further ensures data consistency.

Even though defining data types in RDBMS schemas is highly specific to the RDBMS vendor you choose -- something which can make cross-RDBMS portability a major issue -- the mechanisms to insert and extract data from RDBMS (a.k.a. CRUD operations) are mostly made using the same foundation: Structured Query Language(SQL).

SQL is the language of choice for most RDBMS operations. Though there are various SQL dialects, based on standards like SQL-86, SQL-89, SQL-92 and SQL-2008, as well as a few SQL idiosyncrasies introduced on an RDBMS vendor basis (e.g. date and time formats) on account of the data type issues, SQL's behaviour is fairly uniform across RDBMS.

SQL has four different subsets:

To illustrate the use of SQL, listing 3-5 two DDL statements

Listing 3-5 - DDL statements for Oracle RDBMS and MySQL RDBMS
Oracle DDLMySQL DDL
 
CREATE TABLE player(
     id NUMBER(5) PRIMARY KEY,
     name VARCHAR2(30), 
     dob DATE);
 
CREATE TABLE player(
     id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
     name VARCHAR(30), 
     dob DATE) TYPE=innodb;

As you can note, both DDL statements -- used to create a table named player -- are similar. The differences lie in the data types and particular properties defined by each RDBMS vendor. The DDL on the left hand side corresponds to DDL used by the Oracle RDBMS, while the DDL on the right hand side corresponds to the MySQL RDBMS.

However, even with these differences in SQL DDL statements between RDBMS vendors, SQL DML statements offer better uniformity between RDBMS vendors. Listing 3-6 illustrates several DML statements useable on the player table irrespective of RDBMS vendor

Listing 3-6 - DML statements for RDBMS
SQL DML
INSERT INTO player(id,name,dob) values(4,'David','1975-12-12')Inserts a record into the player table - The Create in CRUD operations
SELECT * FROM playerReads all records from the player table - The Read in CRUD operations
UPDATE player SET name='Dave' WHERE id=4Updates a record from the player table - The Update in CRUD operations
DELETE FROM table WHERE id=4Deletes a record from the player table - The Delete in CRUD operations

As you can see, DML statements have an intuitive syntax and offer the facilities to execute the CRUD operations typical of most applications. Another important characteristic of DML statements is their flexibility for manipulating data, notice how it's possible to do CRUD operations on all the records in a table, individual records or even a group of records, inclusively it's possible to execute DML statements across multiple tables. This is the primary reason RDBMS are an ideal choice for storing and retrieving data with complex and dynamic patterns, since data patterns are relational.

But for all the power SQL provides in terms dynamic queries, its somewhat ironic that in most web applications -- especially those relying on the scaffolding frameworks mentioned earlier -- SQL queries are rarely used directly in applications. This phenomenon is due to how most modern programming languages used in web applications are object orientated languages. Given this fact, transferring data in the form of objects into and out of a RDBMS with SQL queries can become complicated, a problem called: Object/Relational impedance mismatch.

In order to reduce this problem, an extra software layer is often added between an application's business logic tier and the RDBMS called: Object Relation Mapper(ORM). All of the scaffolding frameworks mentioned earlier in this chapter -- as well as many other web frameworks -- rely on an ORM to simplify persisting data -- albeit some use a different names (e.g. Active Record).

Since ORMs are simply abstractions over SQL queries, they rely on a series of algorithms to do persistence operations which can often influence an application's performance. Given this fact, ORMs will be discussed in further detail in part IV of the book which addresses permanent storage tier performance and scalability issues.

Note NoSQL movement

In recent times there has been a strong movement in web applications called NoSQL, a moniker which alludes to the SQL used in RDBMS. The movement's purpose is particularly focused on touting the benefits of using alternative storage technologies to RDBMS -- many discussed in the upcoming sections.

Therefore, when you hear the term NoSQL, it isn't so much focused on the drawbacks of the SQL language or even popular approaches to avoid using SQL -- such as ORM -- but rather on alternatives to RDBMS technology in general.

Another fundamental topic in RDBMS are indexes. Indexes are structures generated on RDBMS tables used to speed up the execution of data queries. In simple terms, an index holds the record values for a single field belonging to a table, so if an operation is attempted on a table, the operation doesn't need to inspect all the record values of every field, but rather just the values associated with the field of an index. This can dramatically impact performance on a RDBMS -- especially for tables holding long records or thousands of records -- since it reduces the amount of data (i.e. I/O operations) needed to be inspected on a table's records to a fraction of them. In addition, since indexes contain part of a table's data, they're also more likely to fit entirely in RDBMS memory, resulting in increased throughput.

Indexes are a deep topic in RDBMS. There are many types of indexes, indexes also have to a maintained or updated and as helpful as they sound, a RDBMS often won't create indexes for you by default, so you have to create them explicitly. For this reason, they will be discussed in greater detail in part IV of the book. However, it's important you understand their function, since some of the upcoming permanent storage solutions either lack indexes or deal with them in different ways.

Finally, RDBMS are practically synonymous with the ACID properties mentioned earlier. Mainly because ACID properties provide an essential role in applications designed as OLTP, since they allow application designers to delegate most data transaction issues to the RDBMS itself, in effect simplifying application design. But even though all RDBMS products share many of the same characteristics (e.g. use tables, support SQL, define schemas,etc), when it comes to ACID properties, not every RDBMS product is the same. This can specially be true for high-end RDBMS products, that for purposes of performance and scalability, offer many options when it comes to ACID property levels. The upcoming chapter on scalability techniques will address some of these options.

Among the more popular RDBMS products in the market you'll find, of the proprietary kind Oracle and DB2 , of the open-source variety MySQL and Postgres , as well as more novel RDBMS products like Drizzle focused on cloud computing needs -- the last of which is actually based on MySQL.

Note Storage engines -- A RDBMS foundations

Underpinning the foundations of any RDBMS is its storage engine. Storage engines to most RDBMS users are like car engines to most drivers, something that's under the hood and just works, not to mention something that shouldn't be tinkered with unless you're knowledgeable with its internal workings. For the most part, many RDBMS products don't even allow customizations to be made to their storage engine. However, there are some RDBMS products -- particularly MySQL -- that have several configuration options.

A RDBMS always uses tables to store its data, but a storage engine can further dictate table storage characteristics like compression ratios, crash recovery facilities and caching strategies, among other things. MySQL in particular has over 10 storage engines from which to choose, that depending on the type and volume of data stored in a table can result in much better RDBMS performance. In MySQL's case, all tables use a default storage engine. However, it's possible to indicate a table's storage engine through DDL. If you look at the previous MySQL DDL example in listing 3-5 , you'll notice it ends with TYPE=innodb;, this indicates a table should use the innodb storage engine.

In addition, given that RDBMS storage engines are tightly knit to what is considered file system characteristics, they're often chosen and deployed alongside particular file systems, which result in better performance. So when you hear that 'a certain RDBMS runs better on X file system', it's perhaps better stated to say that 'a certain storage engine runs better on X file system'.

Since storage engines are central to application performance and scalability issues, I will elaborate on the subject further in section IV of the book.

Object orientated databases

This type of permanent storage system emerged in the 1980's alongside the growing number of object orientated(OO) programming languages. I won't go into a long discussion about the merits of OO programming languages, but needless to say they offer a more natural approach to structuring applications, something that can be confirmed by their established and still current adoption curve.

You would think that with OO programming languages being the basis for most web frameworks in the market, object orientated databases would be the natural choice as a permanent storage system. However, the reality couldn't be farther from this. Even to this day, you're far more likely to see an Object Relational Mapper (ORM) used to persist objects from an OO programming language into a RDBMS, than to see an OO programming language persist objects into an object orientated databases. Something that illustrates the dominant position of RDBMS in web applications focused on OLTP and the lack of popularity of object orientated databases in this same space.

This is not say object orientated databases aren't used anymore. You may still find them in use, but more likely in niche application areas like computer automated design(CAD) and multimedia. As far as web applications are concerned, it's clear their adoption rates have fizzled -- if there ever was such an adoption -- so I will skip to the next permanent storage technology on our list.

Column orientated databases

Column orientated databases evolved to address some of the deficiencies of using RDBMS to back OLAPs. As mentioned previously, web applications designed for analytical processing characterize themselves for working with vast amounts of data and requiring little to no need for creating, updating or deleting data. In the IT world, the permanent storage solutions used for this purposes are called data warehouses and it's often column orientated databases that are chosen for this task.

So what's wrong with RDBMS being used to execute data warehousing tasks ? There isn't anything inherently wrong, in fact analytical processing is a possibility on RDBMS. However, it's done more out of a lack of understanding, budget and often times due to RDBMS data being phased-out and becoming a data warehouse by default. But recapping from the previous section on RDBMS, some of its primary features are precisely some of the drawbacks of using it as a data warehouse: ACID properties and storing data in tables.

If you boil down data warehousing tasks, you'll realize there isn't a particular need to enforce the ACID properties common to RDBMS, for most analytical processing consists of reading and processing data, not adding or updating it.

Enforcing ACID properties on data is costly, in the resource and throughput sense. A RDBMS takes precautions on everything from file locks, caching strategies and I/O reads to make sure these ACID properties are maintained. So this RDBMS feature can essentially hinder the efficiency of tasks that don't even need it. Most column orientated storage solutions relax these ACID'ity properties common to RDBMS, to enhance their efficiency.

Another feature of RDBMS that's a drawback in a data warehouse are its table structures. Though tables and their relationships are a boon for doing complex and dynamic queries, they are inefficient structures for doing massive queries like those involving analytics. Since analytics operations mostly involve comparing and associating entire sets of data, this means a table's entire records would be read constantly. For such cases, even having fine tuned RDBMS indexes would be a moot point. RDBMS indexes just make the lookup of random records faster, reading the entire records of a table -- as most analytics process do -- makes indexes irrelevant, since the entire records are read anyway.

Column orientated databases solve this issue of RDBMS table structures in a simple and clever way, in fact, it's where they get their 'column orientated' name. Listing 3-7 illustrates how the previous RDBMS table presented in listing 3-3 would be structured in a column orientated storage system:

Listing 3-7 - Column orientated data structure
Player Table - For column orientated storage
ID123
NameJohnAdrianPeter
DOB198019821980

This last structure is also called a table in column orientated systems and to further make things confusing it also has rows, just like a RDBMS. So where does the 'column orientated' terminology come from ? The terminology is relative to RDBMS tables. In RDBMS tables each data record is represented as a row with multiple columns or record fields. In column orientated tables, every record's column or field is aggregated and turned into a row.

At first sight the differences between this 'column orientated' data layout in listing 3-7 and the 'row orientated' data layout used in RDBMS in listing 3-4 may seem inconsequential. However, it can have a major impact on data operations made over entire sets of records. Once again we need to fallback on the nature of data warehousing operations to understand this.

For example, some analytical processing related to this column orientated table could include things like 'How many players are named "John" ?' or 'How many players were born before 1980 ?'. If you were to do such queries on a 'row orientated' RDBMS table containing over 1 million records, the RDBMS would need to scan and read 1,000,000 rows, representing an enormous amount of I/O operations on a storage system's hard drives. On the other hand, doing such queries on a 'column orientated' storage system containing over a 1 million records, each query would consist of scanning 1 row since all values are lumped together by column (i.e.Name,DOB).In essence, the 'column' would itself be the index for the entire set of data, with no need to scan multiple rows to find the values.

If you do quick math you might be skeptical at these differences. Because even though you may do 1 row for each query, you're now dealing with enormous rows (e.g. 1 row containing 1,000,000 Name values, 1 row containing 1,000,000 DOB values). But I recommend you do detailed math. If you were to attempt similar queries on a RDBMS table, you would need to read 1,000,000 rows containing 1 Name value each, but in doing so, each of these rows would also contain 1 DOB value, 1 ID and possibly even more and larger fields like a LastName or Description, with each field contributing to more I/O operations, unnecessary to fulfilling the intended analytics operation. Thus 'column orientated' tables can offer superb throughput levels for operations involving large sets of data -- relative to RDBMS -- due to the way they lump data together.

Of course this same characteristic can make column orientated storage equal or worse in performance than using a RDBMS. For example, if you inserted data -- a typical OLTP operation -- on a column orientated data structure, this would require altering several 'column' values given the data layout. The same performance problem would occur if you attempted to update values, since it would become necessary to alter several 'columns' to do so. For this reason, column orientated storage systems are better suited for high-volume access read operations, typical of data warehousing or OLAP systems.

Among the leading column orientated storage systems you'll find, of the proprietary kind Sybase IQ and Exasol, of the open-source variety InfinDB and Infobright .

Multidimensional databases

Another option used in OLAPs are multidimensional databases, often referred to by the short name cubes. Multidimensional databases -- just like column orientated databases -- emerged to address the shortcomings of using RDBMS for analytical processes.

Mutlidimensional databases also structure data in tables. But just like column orientated database tables, they differ from RDBMS tables. Multidimensional databases consist of fact and dimension tables. A fact table aggregates a series of metrics related to an analytical process, where as dimension tables are used to give meaning to the metrics of a fact table. Since fact tables have associations with multiple dimension tables, a multidimensional database is often said to have a star schema -- with a fact table being at the center and the dimension tables associated with it around the center. Listing 3-8 illustrates a fact table with several dimension tables used for analytics on baseball players.

Listing 3-8 - Star schema for multidimensional database
Fact table
At bats
Strikeouts
Runs batted in
Dimension table - Time
Date
Month
Year
Dimension table - Player
ID
Name
DOB
Dimension table - Team
Player_ID
Name
Dimension table - At bat
ID
Strikeout
RBI

The fact table in this case is used to aggregate three types of values: at bats, runs batted in and strikeouts. The meaning for these aggregated values is taken from the various dimension tables associated with the fact table. This association between a fact table and its dimension tables, lead to data structures resembling cubes -- reason why multidimensional databases are often called cubes. Figure 3-6 illustrates the cube structures that can emerge from the previous star schema

Multidimensional database cube structures
Figure 3-6 - Multidimensional database cube structures

As you can see, each axis of a cube represents a dimension table's fields, with each intersecting cell containing an aggregate value for the corresponding dimensions. The number in each cell by itself doesn't mean anything, but because each cell intersects with several dimensions, each number has a meaning. For example, analytical queries like 'How many runs batted in did the player named 'John' have in 2008 ?' or 'How many strikeouts did the entire Yankees team have in 2009 ?', can easily be obtained since they're pre-calculated.

By having these agreggate values pre-calculated, you gain a great deal of efficiency in analytical processes compared to RDBMS schemas, in which you would have to aggregate records in an on-demand fashion, since pre-aggregating values would actually go against RDBMS normal forms. In addition, these pre-aggregated values -- generally stored as arrays -- provide significant space savings over tables, since cell lookups become arithmetic operations rather than indexed searches.

Among the leading multidimensional database storage systems you'll find Oracle Essbase and SQL Server Analysis Services .

Note Extract-Transform-Load (ETL) tools

Now that I've described a series of permanent storage solutions, you might have noticed it's a common situation for one storage system to hold the same data as another (e.g. RDBMS tables and Column orientated tables). With the only difference being that one type of storage has better performance and scalability metrics than the other for particular data access patterns. This is why it's not strange for some parts of an application to interface with a RDBMS, yet other parts of an application to interface with the same data in a different storage system.

This situation has created a large software tools market dubbed ETL -- for Extract, Transform and Load. Therefore when considering the use of a new permanent storage system for pre-existing data already structured in another permanent storage system, I recommend you look over ETL tools. Chances are it will make integrating pre-existing data much easier into a new permanent storage system technology.

Document orientated databases

Another type of storage system are document orientated databases, which take a more pragmatic approach to storing data. Since documents offer a user-friendly format to store data -- compared to RDBMS tables, column tables or multidimensional tables -- they are often the preferred choice when end-users will constantly interact with their own data and there are no particular business rules to enforce on data.

Another main aim -- besides flexibility -- of the first document orientated databases was to centralize and share documents. This allowed users in an organization to work collaboratively on the same document and not have documents isolated on their own workstation. This last characteristic is why you'll also find some of these 1st generation document orientated databases often classified as Groupware , solutions which in fact pre-date the web era.One of the first products in this space was IBM's Lotus Notes .

Paving the way for what I would call 2nd generation document orientated databases was the XML markup language. When interoperability between web applications became a concern, XML became a popular means to exchange data in a programming language agnostic way. With XML becoming a prevalent choice for exchanging data in web applications, the next obvious step became to store data in XML. In this case, the main purpose behind this second generation wasn't so much end-user friendliness, as it was ease of interoperability and data access -- machine-to-machine friendliness if you will.

That said, lets analyze the structure of documents stored on document orientated databases relying on XML. Note that in addition to XML, many products in this space also support similar documents formats like CSV or JSON. Listing 3-9 has several representations of data on a document orientated database taken from the baseball data model used to illustrate the previous permanent storage system's characteristics.

Listing 3-9 - Document orientated database representation in XML, CSV and JSON
Document representation in XML Document representation CSVDocument representation JSON
<?xml version="1.0">
<players> 
 <player> 
  <id>1</id>
  <name>John</name> 
  <dob>1980</dob>
  <teams> 
   <team>Yankees</team>
   <team>Red Sox</team> 
   <team>Tigers</team> 
  </teams> 
 </player> 
 <player> 
  <id>2</id> 
  <name>Adrian</name> 
  <dob>1982</dob>
  <teams>
   <team>Padres</team>
  </teams>
 </player>
 <player>
  <id>3</id> 
  <name>Peter</name> 
  <dob>1980</dob> 
  <teams> 
   <team>Rangers</team> 
   <team>Marlins</team>
  </teams>
 </player>
</players>
ID="1",Name="John",
DOB=1980,Teams="Yankees,Red Sox,Tigers"

ID="2",Name="Adrian",
DOB=1982,Teams="Padres"

ID="3",Name="Peter",
DOB=1980,Teams="Rangers,Marlins"
{"players":
 [
  {"id":1,
   "name":"John",
   "dob":1980,
   "teams":["Yankees","Red Sox","Tigers"]
  },
  {"id":2,
   "name":"Adrian",
   "dob":1982,
   "teams":["Padres"]
  },
  {"id":3,
   "name":"Peter",
   "dob":1980,
   "teams":["Rangers","Marlins"]
  }
 ]
}

The primary difference between storing data in documents as illustrated in this last listing as opposed to tables, is that data is semi-structured. If you recall, all the previous permanent storage systems need that data be placed in tables in a particular way (i.e. according to a schema), document orientated storage systems take a 'hand-off' approach in this sense. Among the primary characteristics of documents stored on a document orientated storage system you'll find the following:

These properties are characteristic of what is often cataloged as a schema-less storage system. System's which have a relaxed consistency property, in the ACID storage system properties sense. The sidebar has more details on these schema-less storage systems.

Note Schema-less and Schema based storage systems

One characteristic you'll often see cited in permanent storage systems is being schema-less. A schema-less storage system is one that doesn't need data records to follow a certain structure. For example, in a schema-less storage system you can store Player records, each with different fields types and values. Schema-based storage systems on the other hand require all records to adhere to a certain structure (i.e. its schema). For example, in a schema-based storage system if you start storing Player records with 5 fields, you can't arbitrarily store a Player record with 7 fields, unless you re-define its schema.

Embedded databases and document orientated databases are generally of the schema-less type, as well as certain distributed databases. RDBMS on the other hand set the norm in terms of schema-based storage systems.

There isn't anything in particular that makes using schema-less storage systems better or worse than using schema-based storage systems. It just depends on the nature of the data you're trying to store. Sometimes its counter-productive to enforce a certain structure (i.e.schema), while on other occasions it's critical to enforce a structure to fulfill the purpose of an application.

This schema-less characteristic alone makes document orientated databases substantially different from RDBMS. But there are other features that make document orientated storage systems vastly different from RDBMS. One in particular is related to their data access operations. Unlike RDBMS that revolve around the use of SQL for data access operations, document orientated systems are more fragmented in their data access patterns.

For example some products operating on the basis of XML documents support data access operations through XML-based languages like XQuery and XPath, which are standards for navigating and manipulating XML. Other solutions on the other hand may offer language specific APIs to execute data access operations against documents (e.g. createRecord("John Smith"), updateRecord(133,"Tom Jones"),etc). And yet others can even support executing data access operations directly on the web, via REST (e.g. http://127.0.0.1/777/create/record/John_Smith, where 777 would represent a document id and the remaining URL parameters CRUD operators and input data). Some of these data access approaches -- especially the one through REST -- can provide for more versatile application architectures, compared to using other permanent storage systems.

Among the more popular document orientated storage solutions, you'll find Qizx , Sedna and MarkLogic .

For a moment you might think I left out the newer document orientated databases like MongoDB and CouchDB. I didn't, even though these are newer document orientated databases -- that I would in fact classify as 3rd generation document orientated databases -- they have a distinctive characteristic in addition to the characteristics I've already mention: they're distributed document orientated databases.

Distributed document orientated databases

Most of the talk you'll hear today about document orientated databases is in great part related to distributed document orientated solutions. Though similar to their other document orientated counterparts in storing data in documents, they differ in being strongly designed on distributed computing and web principles.

On the matter of distributed computing principles, this includes features like parallel data access, fault-tolerance and integrated growth ability to multiple nodes in case of large data sets. While on the matter of web principles, this includes storing documents in web-centric formats (e.g. XML and JSON) and allowing access to documents directly through the web -- via REST.

The designers of these storage systems focus on a distributed architecture is due to the fact that it alleviates one of the primary performance and scalability bottlenecks of storage systems: having a single point where to store and read data. By leveraging multiple locations on which to read and write data, a storage system can do parallelism, the ability to host large data sets spread over multiple nodes and fault-tolerance, all of which in turn increase the performance and scalability of a system.

For this reason, replication and synchronization are centerpieces to the design of distributed document orientated databases. For now having a basic understanding of these concepts -- explained in Chapter 1 -- is enough. The upcoming chapter on scalability techniques addresses these topics and their deeper implications on an application's permanent storage system.

Note

RDBMS also incorporate distributed computing techniques like replication and synchronization to deal with performance and scalability situations. However, these techniques are mostly reserved for high-end RDBMS versions (a.k.a. RDBMS clusters). The upcoming chapter on scalability techniques will address this in greater detail, but for now bear with me as I continue on the generalities of distributed document orientated systems.

Even though introducing replication and synchronization into the design of a document based storage system means increased throughput and fault-tolerance, on account data is available at several locations. By adopting this design, you also bring with it added complexities in the area of transactions and the ACID properties mentioned at the start of this section.

The complexities in this case are best explained with the following scenarios. If simultaneous operations are done on two copies of document "X" -- remember there will be multiple copies at any given time by design -- which of these two copies will hold the definitive data ? In case a subsequent operation is made on document "X", determining this is vital to guaranteeing data consistency. An even simpler case could involve a copy of document "X" suddenly disappearing (e.g. due to a hardware failure), will the remaining copy be up to date ? If so, up to what data transaction ? If not, how do you recover the latest data transactions ?

In order to successfully solve these last scenarios, distributed document orientated databases relax their consistency level -- the C in ACID properties. In doing so, these permanent storage systems -- like any distributed system for that matter -- rely on a variety of techniques. For the moment it's not important that you know these techniques, as they will be the subject of the upcoming chapter. What is important you understand, is that even though you lose on a consistency front, you gain on the performance and scalability front.

Note CAP theorem - Pick two out of three

The CAP theorem, postulated almost 10 years ago by Inktomi's Eric Brewer, states that of three highly desirable properties of distributed systems -- consistency (C), availability (A) and partition-tolerance (P) -- you can only choose two at any one time.

For example, if you want a storage system to have high-levels of consistency and availability, it won't be able to work with data spread out over multiple nodes (i.e.partition-tolerance). If on the other hand, you want a storage system to work with data spread out over multiple nodes (i.e.partition-tolerance) it will either have high-levels of consistency or availability, but not both -- high consistency over multiple nodes requires time which hurts availability; high availability over multiple nodes requires little to no changes which hurts consistency.

Give the CAP theorem time to settle into your thought process, since it's essential to understanding the various permanent storage systems based on distributed computing.

In addition to a strong focus on replication and synchronization, distributed document orientated databases also contemplate sharding in their design. Sharding is a technique highly specific to storage systems, where large data sets or documents in this case, are accommodated across multiple nodes without manual intervention. The upcoming chapter on horizontal scaling goes into greater detail on the topic of sharding.

Finally, it's only natural that given their distributed nature, distributed document orientated databases also support parallel data access mechanism. So in addition to regular document orientated database query mechanisms like XPath or APIs, distributed document orientated databases also support data access mechanisms like MapReduce.

Among the more popular distributed document orientated database systems you'll find MongoDB , CouchDB and Amazon SimpleDB -- the last of which is a cloud computing service.

Distributed key-value databases

Distributed key-value databases is another alternative that has recently caught the attention of web application designers. You may recall that I've already described one storage system that operates on the principles of key-values: embedded databases.

Storing data in the form of key-value pairs provides a flexible and open ended way of permanently storing information, since records can consist of any type of data (e.g. letters, numbers) for both its key and value, without following a strict pattern. This differs from document orientated databases based on the more elaborate storage structure of documents, as well as relational, column and multidimensional storage systems based on the stricter storage structure of tables. Thus key-values pairs are by far the simplest and most flexible means to store data.

In fact, if you look closely at the widely used data structures in programming languages called hash tables, hash maps or associative arrays -- the name varies depending on the programming language -- you'll realize this approach of key-value pairs is a natural form to manipulate data. Obviously, the embedded database pioneers identified this pattern too.

So why the need for a distributed key-value storage system ? Why can't embedded databases be used instead ? Even though embedded databases can grow to store TiB's of data, there are two limiting factors to using these early key-value databases in web applications. The first factor is due to the use of a single storage location, which as reiterated before, creates a contention and failure point that goes against the very principles of performance and scalability. The second factor is that embedded databases are not designed to be directly accessible on the web.

By distributing key-value records across several nodes, you solve the single contention and failure point. Something that also brings parallel data access -- which increases throughput -- and integrated fault-tolerance -- given that key-value pairs are available on multiple nodes. However, similar to their distributed document orientated counterparts, in doing so these storage systems introduce a set of drawbacks not present in regular key-value databases related to the ACID properties they normally enforce.

Distributed key-value databases by design rely heavily on the replication and synchronization of data, something that forces them to approach the consistency property of storage systems -- the C in ACID properties -- in a different way than regular storage systems. For now it's not important to understand these underlying approaches, just that they have implications on the data transactions made on your web applications. The following chapter on scalability techniques will address these approaches in detail.

Inclusively, some distributed key-value databases push the boundaries of these ACID properties further, by relaxing the durability property of storage systems -- the D in ACID properties. They do so in an attempt to further increase the performance and scalability metrics of a storage system. In this case, by leaving data in-memory for a lengthier amount of time before making it durable (i.e. writing it to disk), they limit the overhead of accessing hard drives. Albeit this comes at the cost of reducing durability, in which case a sudden failure can mean some data transactions are lost. But even with this possibility, it's a workable solution for certain applications with very demanding performance requirements.

Among the more popular solutions to emerge in this space are Membase , Redis and Project Voldemort .

Distributed column-row orientated databases

Distributed column-row databases share the primary trait as their other distributed counterparts, a focus on increased scalability and performance at the cost of relaxing the standard ACID properties common to most storage systems.

Before even discussing the technical architecture of these storage systems, I will discuss the specific business needs that brought about their existence. This will give added context to our discussion, in addition to illustrating under what circumstances this technology is better applied.

Envision a storage system requiring the characteristics of a column orientated database -- queries involving aggregate data -- yet still providing the ability to insert new data enforcing ACID properties, all within reasonable time frames. Vanilla column orientated databases of course allow data insertions, the problem lies in them not being efficient for this purpose. A simple solution would consist of adding more resources (e.g. memory & CPUs) to a vanilla column orientated database, until execution times dropped to certain thresholds.

But in addition to providing timely results with these last characteristics, assume this storage system also has to have the capacity to process TiB's (Tera-bytes) of data on daily basis. So much for this last simple solution. Why ? Because simply guaranteeing the up-time of a system processing this amount data is a paramount task, never mind the required resources and potential licensing costs it would need. This scenario is what applications run by organizations like Google and Facebook face on a daily basis, which is why distributed column-row orientated databases were popularized from within these organizations.

For example, in order for Google to find the most relevant search results, a commonly used metric is to consider the amount of links leading to particular page. Extrapolate this to the millions of searches made on Google on a daily basis, where each search is made up of hundreds of pages and each page's weight can consist of dozens or hundreds of links. This process requires aggregation, as well as formal data structures, not to mention execution times in seconds -- something not unlike the features offered by distributed document or key-value storage systems.

Switching gears toward the technical side, these storage systems are called 'column-row' because they store data using some of the same principles as column and row orientated storage systems. Though technically they're often just referred to as 'distributed column orientated', the storage systems in this category can effectively provide hybrid functionality, hence the 'column-row' naming.

As already mentioned, Google pioneered much of the technology related to distributed column-row storage systems. Nevertheless, considering that Google does not offer this particular technology to the public at large -- but rather uses it internally to build most of its services -- several alternatives based on these same principles have emerged. However, as technologically similar as these storage systems are, there are different naming conventions used by each one. Table 3-6 illustrates the naming conventions used for several key components by these distributed column-row databases.

Table 3-6 Distributed column-row implementations and terms
Implementation nameDistributed File SystemFile type storageTable partitionsServersLocking service
Google BigTableGoogle File SystemSSTable filesTabletsMaster serverTablet serverChubby
Hadoop HBaseHadoop File SystemMapFile files (made up of an HStore)Row ranges/regionsHBaseMasterRegion serverN/A
HypertableHadoop File System & othersCellStore filesRow rangesMaster server /DFS brokerRange serverHyperspace

I will start by addressing the primary building block of these storage systems: a parallel fault-tolerant file system. Unlike many of the earlier permanent storage solutions -- even the distributed document and key-value kind -- distributed column-row databases are built on top of parallel fault-tolerant file systems.

This is an extremely important characteristic, since it means part of the core functionality of these storage systems is provided by the file system itself. Some permanent storage solutions in this space support a series of parallel fault-tolerant file systems, where as others support or highly recommend a single option. Once you grasp the purpose and architecture of a parallel fault-tolerant file system -- which is important in itself and is described in a previous section -- we can move onto how the other components of these storage systems, including how they structure and manage their data.

Note

It's possible to run distributed column-row databases in a pseudo-distributed or standalone fashion. These last options allow you to run the storage system on a single node. Though it can he useful for testing or exploration purposes, doing so for a real application is besides the point. Running these storage systems on top of a parallel fault-tolerant file system is the only way to achieve their intended purpose of parallelism and fault-tolerance needed to increase performance and scalability.

Distributed column databases also rely on tables, columns and rows to structure their data, in this sense they aren't much different from some of the previous storage solutions. However, they do have their own particular way of structuring tables, which resembles a hybrid between the previous column and row storage models.

A table is constituted by a row key, column families and column qualifiers. Listing 3-10 illustrates this table structure

Listing 3-10 - Distributed column-row storage system table structure for baseball statistics
Row keyColumn Family (Team)Column Family (At Bats): Column qualifier
Adrian
Padres2010-10-25
At bats:Runs batted in
12010-10-20
32010-10-17
32010-10-12
John
Yankees2010-10-20
RedSox2010-10-15
Tigers2010-10-10
At bats:Runs batted in
22010-10-25
12010-10-20
At bats:Strikeouts
12010-10-25
Peter
Rangers2010-11-28
Marlins2010-11-24
At bats:Runs batted in
42010-10-21
At bats:Strikeouts
12010-10-25
32010-10-22
22010-10-17

Listing 3-11 illustrates another row key, column family and column qualifier table structure, but this time using web site data similar to how Google would use it.

Listing 3-11 - Distributed column-row storage system table structure for web statistics
Row keyColumn Family (Page)Column Family (Anchor): Column qualifier
com.mashupsoft.www
Home2010-10-20
About2010-10-15
Products2010-10-10
Anchor:com.google.www
22010-10-25
12010-10-20
Anchor:com.webforefront.com
12010-10-25
com.remarkbox.www
Home2010-10-25
FAQ2010-10-25
Anchor:com.google.code
12010-10-20
32010-10-17
32010-10-12
com.webforefront.www
Home2010-11-28
Contact2010-11-24
Anchor:com.cnet.www
42010-10-21
Anchor:com.dzone.www
12010-10-25
32010-10-22
22010-10-17
-->

As you can see, this type of table structure lends itself for storing large amounts of data. The row key serves as a type of primary key similar to the one used in RDBMS tables. But unlike the typical fields associated with RDBMS tables -- see listing 3-4 for a refresher -- each key is associated with column families.

Column families aggregate values that are in close relation to one another. A strikingly similar process to how column and multidimensional orientated tables aggregate values. In fact, data access on these storage systems is made by column families, which is where the atomicity of data transactions is enforced.

Another factor to take into consideration is that row keys are defined with values for all or certain column families. This is why this type of storage system is said to hold sparse data. This presents a flexible approach to data integrity rules, not as open-ended as document or key-value storage systems -- since column families enforce structure -- yet not as rigid as relational databases which rely individual data types and other integrity rules.

You might have noticed that each record displayed in the previous listing has a timestamp associated with it. This is another particular characteristic of how distributed column-row storage systems store data. This timestamp is added as a book-keeping procedure by the storage system and has nothing to do with dates you would normally add to your data. Adding this timestamp provides an important role for how data is stored and accessed, so let's get to that next.

Listings 3-10 illustrates a table's logical structure, but what's even more interesting is the way distributed column-row databases store their data at the physical level (i.e. in files). Distributed column-row databases physically store their data on the basis of column families and their corresponding row key. Listing 3-11 illustrates this physical layout.

Listing 3-11 - Key-value physical layout structure for column-row storage system table
KeyValue
Adrian Team 2010-10-25Padres
Adrian At bats:Runs batted in 2010-10-201
Adrian At bats:Runs batted in 2010-10-173
Adrian At bats:Runs batted in 2010-10-123
John Team 2010-10-20Yankees
John Team 2010-10-15RedSox
John team 2010-10-10Tigers
John At bats:Runs batted in 2010-10-252
John At bats:Runs batted in 2010-10-201
John At bats:Strikeouts 2010-10-251
Peter Team 2010-10-28Rangers
Peter Team 2010-10-24Marlins
Peter At bats:Runs batted in 2010-10-214
Peter At bats:Strikeouts 2010-10-251
Peter At bats:Strikeouts 2010-10-223
Peter At bats:Strikeouts 2010-10-172

Note that key-values are stored in descending order based on key and timestamp values. This makes all values corresponding to a particular key be lumped together -- remember a column orientated storage system's approach ? -- so when access is made to one record all similar records are contiguous, making access very efficient. Where as the timestamp ordering provides access to the most recent record first. This physical data layout lends itself to analytical processes.

The physical data structure like the one presented in listing 3-11 is archived by the storage system in special files -- SSTable, MapFile or CellStore files, see table 3-6 for details on these name variations. Since these files are managed by the underlying parallel fault-tolerant file system, they have a particular set of behaviors.

Once each of these files holds a sufficient amount of table data -- 200 MiB or more depending on implementation or configuration -- the table data is split into two files. Each table part emerging from this process is called a region, tablet or row range.

It's the duty of storage system's tablet servers -- or region servers depending on the implementation -- to manage the actual data files, including all reading and writing operations. Leaving a storage system's master server to coordinate all tablet servers, including keeping track of all tablet locations and the health of each tablet server.

When access is required to storage system's table, communication is first established with a system's master server, which knows on which tablet server to locate a table's data. Subsequent communication -- for reading or writing data -- is then established directly to the tablet server holding a table's data.

When a write operation is done on this type of storage system, a tablet server initially commits the operation to a log file -- called a commit log or Hlog depending on the implementation -- which ensures durability or the D in ACID properties. In addition, the information is kept in-memory. If a subsequent read operation is made on the same tablet server, the tablet server first checks if the operation can be fulfilled from in-memory data to increase performance, if it can't be fulfilled the tablet server then fulfills the read operation by opening an actual file.

Another important characteristic of the files managed by tablet servers is they're immutable. This is yet another vital characteristic compared to other data storage systems. By being immutable you eliminate expensive I/O re-write operations on a data files, once a file is closes its data can't be modified or deleted. This is why write operations on these storage systems are relatively fast, there is no need to do seeks or reads to make a writing operation, writes are just appended one after another.

Finally, the physical structure in listing 3-11 has another important consequence with respect to data access. Considering this key-value structure and the distributed nature of the storage system, this lends itself perfectly to the MapReduce data access mechanism described earlier. This is the primary reason this type of storage system is practically synonmous with MapReduce.

Following along the lines of this architecture used by BigTable , HBase and Hypertable , we find a similar storage system called Cassandra . Even though Cassandra is also considered a distributed column-row storage system, it has sufficient differences between these other storage system's that it can considered separately.

Cassandra was originally designed at Facebook, though it has since evolved into a top-level Apache software foundation project. Facebook faced similar data mining problems to Google's. With millions of users, each one having a particular group of friends and each friend having his own set of properties (e.g. email, photos, school). Solutions to scenarios like 'How do you allow users to search among their group friends for a particular item ? Or 'How to execute these tasks so they can scale to such a number of user and still work in fractions of a second ?' all required taking a distinct approach from other storage, just like Google.

The first major difference in Cassandra is that it's not built on top of a parallel fault-tolerant system, but yet still provides the same set of features -- parallel data access and fault-tolerance. How can this be ? Cassandra is a peer-to-peer storage system (P2P), which means every node has an equal weight and can contribute to achieving parallel data access and fault-tolerance. In addition, this also means there are no master or tablet servers, not to mention forgoing the set-up, configuration and management of a parallel fault-tolerant file system can make things much more simpler. Both storage systems offer the same end-results, they just take different design approaches to deliver them.

On the data layout front, Cassandra uses a similar table structure of row keys, column families and column qualifiers. However, Cassandra adds supercolumns to a table's structure, which adds another dimension to the data which can be stored. To put things in simpler terms, supercolumns are column families of columns families.

The reading and writing of data in Cassandra occurs in a similar fashion to BigTable and similar implementations. The minimal difference is the lack of master and tablet servers, since Cassandra operates on a P2P architecture. Finally, another factor that makes Cassandra different is that it can be configured to offer trade-offs between consistency and latency.

Web servers

Web servers provide an important role attending and dispatching requests to all users accessing a web application. Compared to selecting web frameworks or permanent storage systems, web servers are often considered to play a secondary role in the design process of web applications. But when it comes to performance and scalability issues, web servers are just as important.

In order to understand why web servers can play such an important role its necessary to explore their inner workings, and while doing so, take a few steps back in time to the origins of web servers to fully understand why there are so many options to choose from.

Static pages

When the first Internet sites appeared the content served by web servers was static. Things like research papers were marked up in HTML documents making it easier for academics to exchange information. In such cases, a web server's job was straightforward. A web server read a static file from a hard drive and dispatched it to the requesting user.

Out of this need, web servers like Apache became synonymous with the web. Though other alternatives soon blossomed to fulfill features left out by first-movers like Apache, among these alternatives you'll find web servers like Light Httpd , Nginx , Unicorn and Lite Speed .

However, as the demand for features on web sites grew, what we know now as a web application came to fruition. No longer satisfied with static content, users started demanding dynamic content. In such cases, a web server's job grew in complexity. Such demands initially grew from users submitting data in forms, to interactive forums, to shopping catalogs, to the more sophisticated applications that are known today like Custom Relationship Management systems(CRMs).

Adapting to these needs, a series of web servers started offering a mechanism called Common Gateway Interface(CGI).

CGI, FastCGI and SCGI

CGI allowed web servers to generate dynamic content through the invocation of a program. So instead of a web server dispatching content from a static file, a web server would execute a type of program (i.e. a web application) to generate dynamic content and dispatch the results to the requesting user. However, this soon became an inefficient approach.

Why exactly was there a problem with a web server invoking a program ? To illustrate this point, think of this as you would generate content in a text editor on your workstation. In order to generate content, you need to start a text editor -- which is a program -- and then do the required tasks. Starting a text editor on your workstation will take up resources, as well as a few seconds. Now extrapolate this task of starting a text editor 1,000 times in an hour, which would be the equivalent of a web server generating dynamic content for 1,000 users in an hour, pretty inefficient right ?

Well, this is exactly how web applications functioned on the first web servers operating with CGI. When a user requested dynamic content that was generated by an application written in a programming language like Perl, the web server actually started a Perl run-time (a.k.a interpreter) to generate the dynamic content, just as you would start a text editor. As you can imagine, having 10 users requesting the same content simultaneously also implied 10 Perl interpreters running simultaneously.

Note Perl, CGI and web servers

I mention Perl in this last scenario because Perl was one of first programming languages used in web applications. But the same principle would apply to other programming languages and web servers operating under the principles of CGI. A Python application would invoke a Python interpreter, a Java application a Java process, etc.

Though CGI is still supported today in web servers, its inefficiency has seen it blossom various successors, among them FastCGI and Simple Common Gateway Interface(SCGI). In principle, FastCGI and SCGI work in the same way. The web server delegates the execution of an application to a language run-time -- as our Perl case -- and upon finishing, dispatches the results to the requesting user. FastCGI and SCGI differ from CGI in that both are capable of reusing a single run-time (i.e. process) on which to execute multiple user requests, this and a few other efficiency techniques.

The use of CGI, FastCGI or SCGI has the characteristic of having the web server separate from the run-time or execution environment of a web application, whether it's an application written in Perl, Python, Ruby or any other environment. This is often seen as a good design principle, since a web server operates independently of an application's primary execution environment and both can be optimized separately.

However, even though FastCGI and SCGI represent advances over its predecessor CGI, this design principle for web servers is not without its critics. For this reason, another variation for executing web application emerged, where a run-time is embedded or linked to a web server.

Embedded and linked run-times -- Apache modules

Having the run-time on which a web application is executed, embedded or linked to a web server is seen by many as a step forward with respect to the various CGI approaches mentioned in the previous section. One of the web servers that popularized this approach is the Apache web server produced by the Apache software foundation.

Modules -- as Apache calls this approach -- are language specific libraries installed on the web server. Since these libraries are compiled against a particular programming language's run-time, this results in a web server being embedded or linked with a run-time, often resulting in better performance (See sidebar 'Are Apache modules better than FCGI?' for more on this affirmation).

Apache has modules for a wide array of programming languages. A module in Apache uses the following naming convention mod_<programming_language>. So for example, these are some of the modules you might encounter when running applications with Apache:

All these modules are available in either binary format or source code. Generally, for production environments it's recommended that you get a module's source code and compile it on the target server so the module is optimized for the specific versions you will use.

In addition, it's also possible to compile the entire Apache web server with these modules, so that modules are statically compiled against the Apache web server binary and not loaded as Dynamically Shared Objects. Depending on the Apache version you use, many binaries will come with statically compiled modules, so you may not need to install any modules or compile your own version of Apache. For more on the advantages and disadvantages of using modules as Dynamically Shared Objects see: http://httpd.apache.org/docs/2.0/dso.html#advantages

Independently of the way you opt to use any of these modules with the Apache web server, all modules still delegate executing an application to the proper run-time environment.

So for example, when an Apache web server relies on the mod_php module, if it detects a .php extension, it will delegate the request to the PHP run-time linked to the module for processing. After the PHP run-time finishes, the Apache web server dispatches the contents out to the requesting user. A similar process occurs for other modules like mod_ruby, mod_python, mod_jk, mod_mono and mod_aspdotnet.

Now you may ask yourself, what's the difference between executing an application in a run-time using an Apache module and a run-time using something like FastCGI ?

In the case of mod_php, mod_ruby and mod_python, a run-time (a.k.a interpreter) for each language is embedded in the Apache web server's process. This in turn makes the processing of PHP, Ruby and Python web application's potentially faster. Since a web server does not need to delegate the execution to an external run-time, but can instead rely on the embedded run-time within Apache.

The emphasis on potentially faster execution is due to the fact that a web server process with an embedded run-time can become substantially larger. This is an entire run-time with all its library APIs -- whether the APIs are used or not by a web application. So if a run-time consists of 10 MiB, that's an extra 10 MiB in physical memory needed to run a each web server process, an amount that can quickly add up.

This is why run-times (a.k.a interpreters) used to back mod_php, mod_ruby and mod_python are highly optimized for this use. They're limited to library APIs common to web applications, as this reduces the overall memory foot-print of the run-time and with it the web server process.

This is also why web servers need to be judiciously designed to use only those modules needed by an application. If a web server only dispatches Ruby applications, installing mod_php and mod_python unnecessarily embeds the corresponding run-times into a web server process, increasing the resources needed to run a web server. By the same token, if a web server only dispatches static pages, adding any type of module is unwarranted. Note that this last technique is called 'Static content separation' and is described in detail in part II of the book, which describes performance and scalability techniques for an application's static tier.

In the case of mod_jk, mod_mono and mod_aspdotnet, these modules have different behaviour given the mechanics of Java and .NET web applications. As such, their respective environments are not embedded into Apache's process. Instead mod_jk, mod_mono and mod_aspdotnet connect to each of these language's web platforms via TCP to execute Java and .NET web applications, respectively.

This subtle difference between Apache modules deserves a closer examination.

In PHP, Ruby and Python each language's run-time is general in nature. Which is to say, application's written for the web, graphical user interfaces or the command line are all executed in the same run-time. Such run-times simply require installing additional libraries to support different behaviours, such as web libraries to support a particular web framework. This makes embedding the run-time a natural step.

In Java and .NET though, in addition to having a base run-time used by all application's -- the Java Virtual Machine(JVM) & CLR(Common Language Runtime) -- there is also a specific run-time for web applications. In the case of Java, a separate environment is used to execute Java Server Pages(JSPs) and Servlets which are the components used in Java web application. And in the case of .NET, a separate environment is used to execute ASP.NET and ADO.NET components which form the foundations for .NET web applications.

The difference for Java and .NET is that these dedicated web environments can actually fulfill the duties of a web server by attending user requests. This makes embedding such run-times impractical. Instead Apache modules communicates via TCP with these environments -- or language specific web platforms -- as an integration facility.

Next, I will describe these language specific web platforms that can fulfill the work of web servers. And how Java and .NET address this approach, including similar initiatives that have emerged for PHP, Ruby and Python.

It's worth mentioning that Apache is not the only web server that offers an approach to using modules, whose aim is to offer performance enhancements over CGI* approaches. The LiteSpeed web server offers a similar approach to Apache modules. Designed to increase its performance with respect to CGI* approaches, it's called the LiteSpeed Server Application Programming Interface(LSAPI). Both PHP and Ruby implementations are well-known LSAPI implementations.

Note Are embedded run-times(Modules) better than FastCGI ?

Using an embedded run-time in a web server process vs. relying on a run-time outside the web server process -- even for cases using FastCGI where a single run-time is shared -- is a highly debated topic.

The issue with embedding a run-time like PHP, Ruby or Python in a web server process, is that the entire run-time(a.k.a interpreter) is embedded in the web server process. This means that if a run-time has libraries for things like image rendering, 3D animations or language dictionaries that may have nothing to do with supporting a web application, these libraries will take up resources(memory) in the web server's process.

If a single web server process is used with an embedded run-time, this is not cause for much debate. The problem arises when a web server spawns different processes to attend incoming requests as load increases, all of which will have an embedded run-time.

On the other hand, since FastCGI relies on a run-time outside the web server process, the issue of spawning multiple web servers processes is a non-issue since the size of each process is limited to the web server itself. However, this doesn't mean a mechanism like FastCGI is exempt from running multiple run-times . Much like the embedded run-time approach, FastCGI often requires multiple run-time processes to attend increasing demand, so the consumption of resources(memory) can be in a similar range.

As you will realize, both approaches have certain benefits that can be offset by certain circumstances. It will depend on your requirements which option is more efficient. But in fact, other potentially more efficient approaches than embedded run-times or FastCGI have been devised for the more popular programming languages used in web applications, which is the topic of the next section.

Application servers -- Language specific web platforms

Application servers were initially introduced with the advent of Java into web applications. The architects behind Java devised a distinctive model in which to develop web applications, both in terms of development and run-time.

For structuring web applications Java designers created Java Server Pages(JSPs) and Servlets, in addition to Enterprise Java Beans(EJBs) which served the former two as a means to access permanent storage systems and support more elaborate business processes. In order to execute all these last components, a special environment coined 'application server' was created.

An application server in Java consists of two parts: A web container used as the run-time for JSPs and Servlets, as well as an EJB container used as the run-time for EJBs. More recently, it has become common for Java web applications to forgo the use of EJBs altogether and as a consequence run on a standalone web container -- which though technically not an application server by definition, is still part of one.

A web container offers a run-time environment to streamline the deployment of Java web applications. Among these capabilities are: the capacity to attend user requests directly -- just like a web server -- the ability to manage connections to permanent storage systems, define security policies for Java web applications and other particularities unique to running JSPs and Servlets.

Though web containers still rely on Java's core run-time the Java Virtual Machine(JVM)to run, they set a precedent -- along with Java application servers -- as the first self-contained environment on which to run web applications. This unlike the Apache web server, which was initially designed to serve static content and was extended to support web applications (e.g. via CGI* approaches or embedded run-times(modules)).

Followed in time, came Internet Information Server(IIS) produced by Microsoft. This environment was similarly streamlined to deploy web applications designed around Microsoft technology. Initially in the form of Active Server Pages(ASP) and more recently with the introduction of .NET, ASP.NET and ADO.NET technology. In principle, IIS serves the same purpose as web containers as a self-contained environment in which to run web applications.

This pattern in Java and .NET is one of building a web server on the particularities of a programming language -- including any ancillary support needed to create web application in such a language. Instead of adapting a web server -- like Apache -- to execute application's. Hence the dropping of the term 'web' in server for 'application server' or 'web platform'.

Such an approach of building a self-contained environment in which to run web applications was seen in good light by other programming language designers. And so to a series of application servers or web platforms have emerged for other languages.

Continuing in such lines you can find the Zend Server which is an application server designed for PHP. It offers enhancements in the form of PHP bytecode caching, out-of-the-box connectivity to many permanent storage systems, as well as other facilities that make the deployment of PHP web applications easier and often with better performance and scalability results, than if you were using a web server like Apache embedded(mod_php) or linked(CGI*) with a PHP run-time.

Following a similar pattern, Ruby developers have at their disposal Mongrel. Though not called outright an 'application server for Ruby', Mongrel's design as a server initially relied on the Ruby run-time threading system. In this sense, it was initially and specifically designed to take advantage of features present in Ruby's run-time that would otherwise go unused by embedding Ruby's run-time in a web server like Apache. Making Mongrel the closest thing to an application server for Ruby.I emphasized initially, because since its inception, Mongrel has evolved into a language agnostic web server, which now supports not only Ruby, but also other languages like PHP, Haskell and .NET.

Next, we come to Python. Though this programming language has had an application server by the name ZOPE(Zope Object Publishing Environment) for quite some time, unlike Mongrel for Ruby which is capable of running any Ruby code or Zend Server which is not pegged to a particular PHP web framework, the Zope application server is designed with the Zope web framework in mind.

Given this fact, it's not possible to run ordinary Python code or a Python web framework like Django in Zope. Instead the Python community -- including Django -- has settled on the Python specification called WSGI(Web Server Gateway Interface) which is a standardized interface between Web servers and Python web frameworks/applications.

Being a standard closely developed to Python's run-time, this makes web servers that support WSGI the closest thing to an application server for Python. Many standalone web servers -- including Apache -- support this standard, which is also the recommended choice by many web frameworks over the previously mentioned CGI* approaches or embedded run-time (mod_python).

Note Are application servers better than web servers with embedded run-times ?

Much like the previous side-bar comparing embedded run-times to FastCGI, the answer depends on your circumstances. Though language specific web application servers are optimized to execute applications in a particular programming language, they can also present a series of drawbacks some of which are not necessarily technical.

A web server like Apache and its two approaches to running web application -- modules and CGI/FastCGI is unlikely to lose appeal over time. For one, Apache is one of the most widely used servers, making it not only the most stable in terms of less bugs and less security vulnerabilities, but also what most administrators have learned to deploy.

Language specific web application servers while designed for better performance and scalability, require a greater commitment from application developers and the organization using them. On the one hand, most are likely to need special setup procedures. In addition, most if not all language specific application servers are sure to be backed by a commercial entity.

On the technical front, you also need to realize application servers can be vulnerable to Internet security attacks -- since they undergo the operations of a web server and are exposed to the public at large. The vulnerability discovery process in application servers can be vital to guaranteeing application integrity. Where as using a web server like Apache with an embedded run-time, it's less likely to be subject to such security attacks, given the longer history and ample user base of a web server like Apache. This potential security risk is why you'll often see Apache web servers running as a proxy for application servers, avoiding that application servers face the wider internet directly.

Much like the previous side-bar, it will depend on a series of factors beyond those of performance and scalability to determine if a language specific web platform is better choice your projects.

« A web application's limited resourcesPerformance and scalability techniques 101 »