While only few desktop applications take advantage of recent 64-bit processors and technologies, the industry’s demand for high performance processing platforms was continuously increasing over the last decade. Since a lot of issues are needed to be solved, e.g. binary compatibility, simultaneous execution of 32-bit and 64-bit code and the like, a transition from 32-bit computing to 64-bit computing isn’t as trivial as it may seem at first glance. Due to the lack of standards the industry has silently agreed on a well known data model for 64-bit computing that seemed to be a natural choice. This article explains why the so called LP64 data model is superior to others and what considerations led to this selection.
Differences Between Processors
Good news for consumers are often bad news for system developers. The introduction of a new processor architecture or a completely new programming model is usually of great benefit to users but will most likely require large amounts of work for a successful transition to a new technology, especially for developers. They are faced with problems and obstacles that are directly associated with the computer processor design such as storage requirements (e.g. data alignment, byte order, etc.). In this section we will take a brief look at some crucial issues we will encounter when moving to another processor or computing architecture.
Alignment of Data
Almost every processor requires a certain alignment of data in memory. When accessing n bytes of data in memory the starting address of the memory location must be a multiple of n. Unfortunately, processors have different requirements for memory access. Intel’s x86 architecture allows unaligned access which results in a significant performance penalty. RISC processors on the other hand do usually not allow unaligned memory access. They answer with a processor fault to such an operation that is either handled by a software trap or simply causes a system crash. (Sometimes unaligned access is handled completely by software and is therefore extremely slow.)
Byte Order
Depending on the architecture two ways of data representation are in use indicating the byte order within the data type: little endian and big endian. Little endian architectures (like Intel’s x86) place the least significant bytes (LSB) at a lower address while big endian architectures (e.g. Alpha and PowerPC) place the most significant bytes (MSB) first. Normally byte order (also called endianess) is hard coded in to the processors by its designers, but some processors are able to support both data representations like SGI’s MIPS. This is known as bi-endianess. For some reason the PDP-11 stored 32-bit values as two big-endian shorts with the least significant short placed at the lower address.
Host Byte Order
To find out what endianess your current processor is using you can write a simple C program that detects the byte order and prints the result:
#include <stdio.h> int is_little_endian(void); int main(void) { printf("%sn", (is_little_endian() ? "Little endian" : "Big endian")); return (0); } int is_little_endian(void) { short int word = 0x0001; char *byte = (char *)&word; return (byte[0] ? 1 : 0); }
A common problem for programmers is portability. Exchanging data between architectures with different byte order is a problem. Consider the following code snippet that illustrates a common but bad practice:
void write_long(FILE *f, long l) { /* Oops! Storing in byte order of current processor. */ fwrite(&l, sizeof(l), 1, f); } unsigned long read_long(FILE *fp) { long l; /* Oops! Assuming same byte order as written to disk. */ fread(&l, sizeof(l), 1, fp); return (l); }
Of course, it is fairly legal to write code like this but the application is limited to the platform the code was written on. A very popular solution to overcome this limitation is to introduce a header for a binray file that contains enough information to allow applications to discover the byte order of the file. The application must then convert the bytes into the appropriate byte order if necessary.
Network Byte Order
Besides the byte order of processors there exists a fixed byte order for network communication. The TCP/IP protocol suite requires big-endian byte order for data transfer. It is provided by the network layer and not by the application data being transmitted. A very common example is the internal representation of IPv4 address and the port associated with a socket. Both are part of the sockaddr structure that must be converted into network byte order before transmission. Unexperienced programmers often write code like this:
struct sockaddr_in server; server.sin_port = 80; /* non-portable */
Little endian architectures will not behave as expected, although everything looks correct. The standard C library provides for functions to convert to network byte order automatically. Even if we know what byte order we are currently working on we must make use of theses functions to maximize portability of our code. The correct version of the previous code snippet would then look like this:
struct sockaddr_in server; server.sin_port = htons(80); /* this is portable */
Here the function htons (host to network short) converts a short to network byte order. The reverse can be achieved with ntohs (network to host short) that converts to byte order of the host system.
Size of Native Types
The processor’s size of its registers designates its natural word size. For many 32-bit systems the natural word size is four Bytes (sizeof(int) == 4), although Intel’s x86 can switch to two bytes to retain backward compatibility.
With the advent of 64-bit processors it became clear that this assumption can not be held. What is the new natural word size of this processor generation? Still four bytes or eight bytes? Or both? As a result a lot of data models have been developed with different goals in mind. Some of them emphasized on portability, others on compatibility to 32-bit processors. Until now, five models have been introduced, two of which are currently used for 32-bit processing: LP64, ILP64, LLP64, ILP32, and LP32. The next section describes each model and explains why LP64 is the model of choice.
Data Models
A data model specifies the size of data types imposed by the underlying architecture. It is usually implemented by the operating systems so that multiple data models can be used on the same processor. The following subsections describe the most popular data models used in 32-bit and 64-bit computing environments.
ILP32 and LP32
Todays programmers are familiar with the traditional 32-bit data model ILP32, where integers, longs and pointers are 32 bits in size. LP32 is an even simpler specification designed for Intel’s processors to implement backward compatibility to its 8086 processor family. Here, only longs and pointers are of 32 and integers are of 16 bits in size.
Obviously ILP32 lacks true 64-bit data types and is therfore inappropriate for use on 64-bit processors, that are able to address large amounts of memory way beyond the 4 GB boundary of 32-bit systems.
The following table summarizes the size of data types for both, ILP32 and LP32:
| Datatype | ILP32 | LP32 |
|---|---|---|
| char | 8 | 8 |
| short | 16 | 16 |
| int | 32 | 16 |
| long | 32 | 32 |
| pointer | 32 | 32 |
LP64, ILP64 and LLP64
One of the key benefits of 64-bit processors is the larger address space. Computer industry vendors are now able to develop software systems that allow addressing of up to 512 Terrabytes of storage without circumventing the 4 GB barrier presented by 32-bit processors. This capability is very significant for new high performance applications like database systems or full-motion video. In conjunction with decreasing prices for memory and storage and with continuously increasing computing power new data models have been developed to address these new requirements.
Technical Background
Adding new data types to the C programming language is not possible per se. With the introduction of 64-bit addressing and new arithmetic capabilities, application developers are forced to change mappings or bindings of existing types or to add new data types to the language.
The most popular data models are LP64, ILP64 and LLP64. The nomenclature describes the size of the three basic types: long, int and pointer types. The size of each type of a particular data model is summarized in the following table:
| Datatype | LP64 | ILP64 | LLP64 | ILP32 | LP32 |
|---|---|---|---|---|---|
| char | 8 | 8 | 8 | 8 | 8 |
| short | 16 | 16 | 16 | 16 | 16 |
| int | 32 | 64 | 32 | 32 | 16 |
| long | 64 | 64 | 32 | 32 | 32 |
| long long | 64 | ||||
| pointer | 64 | 64 | 64 | 32 | 32 |
LLP64 Data Model
As we can see from the table above the relationship between long and int is preserved by specifying a size of 32 bits for both. So, data objects not containing pointers will have the same size as found in the ILP32 data model found in almost every 32-bit computing environment.
To support new 64-bit scalar types a new, non-portable data type has been introduced. One can say LLP64 is a true 32-bit data model working with 64-bit addresses. Therein, potential runtime problems are buried regarding the assumption of the size of data types. A pointer will not fit into an int as opposed to a real 32-bit data model like ILP32. To overcome this problem int and long variables in source codes are changed to long long.
This leads directly to another problem with respect to the system APIs. It would require either a change of data type definition of the APIs or the introduction of a new set of 64-bit interfaces. As we can see, LLP64 is far from optimal, since it forces major changes to API specifications where support for 64-bit wide types is not naturally required.
ILP64 Data Model
In contrast to LLP64 the ILP64 data model tries to mimic the relationship between all three basic types ILP32 developers are used to by making int, long and pointer types the same size. Converting or assigning pointers to int or long does not result in data truncation. The major disadvantage of this model is that it is either non-portable or requires the addition of 32-bit data types, int32 for example. This would break with existing typedefs and does not reflect the fundamental spirit of C, which has avoided integration of size descriptions into basic types ever since. Also, system software that depends on data alignment and size may be rendered non-portable, since it is forced to introduce non-standard data types.
As already seen with the LLP64 data model ILP64 requires frequent changes to source codes to allow interchange of data between 32-bit and 64-bit computing environments. The introduction of non-standard data types for sake of interoperability and binary compatibility would be an option but breaks with the basic industry demand for portability.
LP64 Data Model
The last data model is LP64. It takes the best from both worlds by preserving the size of char, short and int as known from ILP32. System software relying on data alignment and size is uneffected and can be used on 32-bit systems without problems. Additionally, a real 64-bit data type is introduced to take full advantage of arithmetic capabilities (especially with regard to pointer arithmetic). For this reason, programs that are made 64-bit clean by changing assignment of addresses to scalar types from int to long can be recompiled and run on 32-bit computing environments without awkward changes as described with LLP64 and ILP64.
Summary
To make meaningful statements about the superiority of a particular model one must perform thorough investigations and evaluate the results, which is far beyond my technical capabilities in terms of hardware equipment. But from the comments made above and the rationales of each data model, we can deduce that LP64 is the model of choice for the following reasons:
- Portability is maximized through LP64 and common problems associated with this criteria can be detected automatically.
- As a matter of fact, interoperability is enhanced by the ability to incorporate standard data types and structures that can be used in both, 32-bit and 64-bit environments.
- The transition from one computing environment to another is easy and has been proven to be smooth by experience and success of real-world projects.
Besides all that, the natural use of C data types to accommodate all the widths needed in a 64-bit environment is the strongest argument for the proliferation of the LP64 data model.
graegerts