Computer Design A computer aided design and VLSI approach Paul J. Drongowski Chapter 2 - Computer architecture. In the early 1960's, computer vendors recognized that software compatibility had become a crucial issue with the user community. Customers had invested considerable sums of money in application programs only to find that the new machines had incompatible instruction sets and their investment was lost. (Remember that much of the software of that era was written in assembly language.) The designers of the IBM 360 sought to alleviate the compatibility problem by defining and developing a whole series of machines with the same instruction set, but varying levels of performance. Thus, the notion of a computer family or architecture was born. The contemporary idea of computer architecture has expanded beyond the definition of common instruction sets to include standard communication protocols, resource management schemes and even languages. We will discuss these aspects of computer architecture in this chapter. How a machine is structured and built to realize a particular architecture is called "computer organization," the primary topic of this book. Section 1 - Instruction set architecture. The instruction set architecture (ISA) is a "programmer's eye view" of the operation of a machine with that ISA. All members of an architectural family must operate in accordance with the abstraction specification of the ISA to guarantee software compatibility at the machine code level. The ISA consists of a description of the registers, instructions (operations and addressing modes), and special programming features for input/output (I/O), memory and processor management. The registers provide a relatively small amount of fast memory for scratchpad computations and control. They are used to store information with a short lifetime and when the use of the information is very frequent or urgent. The programmer ordinarily has one or more registers available for direct use by the application program. (The exception is a stack architecture in which all intermediate results are carried in a push-down stack.) Other registers are used primarily for control and access to them may be limited or restricted in some way. The PDP-11 architecture, for example, provides eight 16-bit general registers. However, two of the registers are special purpose and are the stack pointer and program counter. The other six registers are truly general purpose and may be used for arithmetic and logical computations, memory addressing (indirect and indexed memory access), and subroutine linkage. The instruction set itself is often separated into operations and addressing modes. It is not unusual to find a common set of memory access modes which may be used in a regular manner across a large subset of the operations. Machines that employ this concept are said to be "orthogonal." Conceivably, the operations and addressing modes could be arranged along two orthogonal axes and any combination of operation and addressing mode in that space is meaningful. It is generally easier to construct a compiler for an orthogonal machine as fewer peculiarities or special cases must be handled when generating code. Architectures may be classified by the number of addresses permitted within a single instruction. The common types of machines are listed below. Zero address. Stack architectures are zero address machines. Operands and results are maintained on a push-down stack. To perform an addition, for example, the two operands are popped from the stack and added together. The sum is then pushed onto the stack where it will await the next operation. The location of the operands are implicit in the instructions (the top of the stack) and hence, no explicit address is necessary. Compiler code generation for a stack machine is easy because the code for expression evaluation is identical to the Polish prefix form of an expression. One address. One address architectures are sometimes called "load and store" machines. Each operand is loaded from primary memory into a separate register using a one address load instruction. The operation is performed and a one address store instruction moves the result to memory. As an alternative, dyadic operations may be performed on a register operand and a single memory value with the result stored in the register. Two address. In a two address machine, both operands are explicitly identified. The result is always stored to the location of the second operand. The double operand instructions of the PDP-11 are a good example. Three address. Three address instructions specify the location of the two operands and the destination of the result. Code generation for a three address machine is also straight forward since may be easily selected from and written to any address in primary memory. Choice of a zero, one, two or three address architecture will depend upon the ease and efficiency of compiling, program size size and memory utilization requirements and the desired frequency of relatively slow primary memory access. At least eight kinds of memory addressing modes are in common use. General register architectures that use a file (vector) of word registers have a register (or register to register) mode for selecting a register operand from the file. In direct or absolute addressing, the identity of a memory location is part of the instruction itself. An indirect address specifies a register or memory location which contains the absolute address of the operand or destination. Constants can be introduced into a program through "immediate" mode operands. Here the constant value is stored with the instruction and is used during the execution of the instruction. Indexed addressing is used to access the elements of a vector or array in memory. The base address of the array is stored explicitly as part of the instruction and is added to an index value from a general register. A special case of indexed addressing is "program counter relative" addressing. In this case, the program counter register is added to offset stored with the instruction. Code which uses program counter relative addressing exclusively is easily relocated (shifted about) in memory. An index value may be automatically increased or decreased by a constant (one or two in the PDP-11) as a side effect of the instruction. This is sometime called "auto increment" or "auto decrement" addressing. The increment or decrement operation may either precede or follow actual access to the memory location for "pre" and "post" increment (decrement) addressing, respectively. The PDP-11 uses pre decrement and post increment register indirect addressing. This naturally yields stack push and pop operations when applied to a stack pointer. Instruction set operations include such arithmetic and logical operations as addition, subtraction, two's and one's complement (logical NOT), bit-wise AND, bit-wise OR, and shifts (logical, arithmetic and rotating.) Machines which are tailored for a particular market segment often include specialized instructions that increase the speed of important customer applications in the segment. For example, an architecture for the scientific market might have an instruction to evaluate polynomials, an operation that occurs during the computation of sines, cosines and other transcendental functions. A business machine might have instructions for editing character strings. As a final example, queue and process management instructions could be built into a computer for the realtime processing market where the scheduling of activities is important and frequent. The increasing complexity of machine instruction sets created a backlash in the early 1980's. A return to "reduced instruction set computers" (RISC) was proposed with the intent of increasing processor speed through the reduction of hardware complexity. The early machines of von Neumann and others were RISC's since the cost and unreliability of active vacuum tube circuitry in the 1940's and 50's dictated the minimization of hardware and instruction set size. RISC's employ very elementary instructions that can be evaluated during one clock period. Increased speed is gained by pipelined parallelism, simple datapath circuitry and a short clock period. As a side benefit, the functional density of very large scale integrated (VLSI) RISC is substantially increased. Complex instruction set computers (CISC) use large control stores to retain the low level (micro) programming needed to implement a large set of complicated (macro) instructions. RISC proponents argue that most CISC instructions are under (or never) utilized and that large control stores are fundamentally incompatible with VLSI circuit technology. As virtually all interesting programs make decisions and have one or more loops, instructions that alter program flow are available. Conditional goto's test the current state of the registers or condition flags and switch the execution flow to a new location. Unconditional goto's merely force instruction selection to a new address. Common or frequently used instruction sequences may be stored and called as subroutines. The instruction set usually includes an instruction to call a subroutine by saving the current value of the program counter register and jumping to the subroutine entry location. When the subroutine completes its work, it executes a return instruction which restores the return address to the program counter register. The use of programming languages with complex subroutine calling sequences has led to specialized call instructions in complex instruction set computers. The VAX CALLS instruction, for example, builds an entire call frame on the execution stack. As we will discuss in Section 3, architecture often includes features for the management of computing resources. The ISA must provide instructions assist the resource management task. One task, input/output between primary memory and secondary storage devices, may be accommodated in one or two ways through programmed or memory-mapped I/O. With programmed I/O, special instructions are provided to control and communication with I/O devices over the I/O bus. These instructions send and receive data and control information, initiate device operations, and sense error and device completion signals. In memory-mapped I/O, devices are manipulated through special registers which are accessible as standard primary memory locations. No specialized instructions are required as the standard memory access instructions may be used for device control and communication. Section 2 - Communications. As computer vendors began to use a common ISA across members of a product family, they came to recognize the importance of well-defined, standard interfaces. If the I/O bus can be standardized, for example, peripherals can be shared between family members. Further, the peripherals do not need to be discarded due to an upgrade to a faster processing unit if the new processing unit is I/O bus compatible with the old unit. This protects a customer's investment in peripheral equipment and makes a product family more attractive. Busses may be proprietary or they may be an industry standard. Proprietary busses may protect the technological advantage of a vendor by hiding details about the operation of the bus. The use of a proprietary bus may also limit the size and number of "third party" vendors who provide an alternative source of equipment for a vendor's product line. Sometimes a proprietary bus becomes popular enough to be standardized. For example, the Shugart Associates System Interface (SASI) is the basis for the Small Computer System Interface (SCSI) for personal computer equipment. The Intel Multibus became IEEE standard STD-796. Section 3 - Resource management. Operating systems have also had a substantial impact on computer architecture. The primary function performed by an operating system is resource management -- the allocation and control of computing resources. Three kinds of physical resources are central to any computing system: the processor, primary memory and the I/O devices (including secondary storage.) It is the job of the operating system to transform these physical resources into programs (processes or activities), files and virtual I/O channels. The processor is the locus of all computational activity in the system. In a modern multiprogramming system where a number of user jobs share a computing resource, the operating system must schedule those jobs for execution. Scheduling is usually broken into two levels: strategic resource management and physical scheduling. Strategic level scheduling uses process or job queues to arrange user jobs by priority or need. The physical scheduler actually moves a job onto the physical processor for execution in accordance with the strategic scheduling strategy. Moving idle or waiting jobs off and read to run jobs onto the processor is a time consuming task. Special instructions for queue management and physical scheduling ("context switching") can be provided to speed this task. Primary memory holds the program code and data during and between execution on the processor. To accommodate large programs and to better utilize the primary store, a scheme called "virtual memory" is employed. In virtual memory, only the portion of a program which is essential to its execution, its "working set," needs to be in primary memory for execution. Thus, many more jobs may be in memory waiting to run and the total memory requirements of a job may exceed the physical size of primary store. When an essential piece of the program is found missing (a "memory fault"), the operating system must bring that piece in from secondary storage. If another job is ready to run, it may be physical scheduled onto the processor while the missing piece is brought in. Virtual memory requires a means to partition a program into "pieces." The two most common schemes are "paging," which uses fixed size pieces called "pages," and "segmentation," which uses variable size pieces called "segments." The architecture must define the procedure from mapping virtual program addresses into physical memory addresses such that a program may be relocated anywhere in memory. Special instructions may be needed to set-up and control the mapping procedure and to move information between virtual memory spaces. An important task in any operating system is the protection of information from either programming errors or malicious use. Protection is accomplished by controlling to disk files, pages or segments. Again, special instructions or memory address mapping operations must be included in the architectural specification. Application programmers are optimists. System programmers and operating system designers are pessimists because they must deal with the errors and unusual circumstances created by the optimists! Unanticipated conditions and events are called "exceptions" and the operating system must cleanly handling exceptions without interfering with the other users or worse, crashing the system. Exceptions cover a broad range of conditions including virtual memory faults, arithmetic errors (e.g., divide by zero), attempted execution of undefined (illegal) instructions, I/O device errors, unauthorized attempts to access protected information and power failures. Exceptions are reported by "traps" or "interrupts" which disturb the normal program flow. Execution is switched to an operating system routine that was provided to handle the exception. The mechanisms for detecting an exception, switching to the operating system, and resuming normal execution must be specified for the architecture. Section 4 - Architecture and technology. The discussion in this chapter has centered upon a particular kind of architecture known as the "von Neumann machine." In the late 1940's, von Neumann, Banks and Goldstine first defined what has become the most popular and prevalent kind of computing engine. The von Neumann architecture has several distinguishing features or principles. First, primary memory is organized into a linearly addressable array of fixed length binary words. Both instructions and data are stored in primary memory and are indistinguishable from one another (both data and instructions are binary values.) A special register known as an instruction pointer or program counter is used to select and fetch instructions from memory for execution. Goto instructions are available to conditionally and unconditionally alter the selection of instructions by changing the value of the instruction pointer. The program counter is advanced instruction by instruction in sequential order through memory for ordinary instructions. The hardware is synchronous and is controlled by a central clock. Finally, all arithmetic is performed in parallel rather than bit-serial fashion using relatively primitive data operations. Several of the von Neumann principles are self-reinforcing. For example, the sequential selection of instruction and linear addressing of the memory are consistent. The binary encoding of instructions and data permits the mechanical manipulation of instructions as data, leading to assemblers, compilers and higher level languages. It is probably the self-consistency of the von Neumann engine which has given it beauty and longevity. The von Neumann architecture, however, has certain drawbacks. All instructions and data must move through the single channel between the processor and memory. Due to the synchronous, sequential control of the machine, operations can only be performed one at a time. Therefore, several computations may not proceed at once concurrently even if the data for those computations is available. Thus, the opportunities for exploiting program parallelism are limited. All data structures (records, linked lists, arrays, etc.) must be mapped onto the linearly addressable store. Certain kinds of data structures like sets and bags have notably inefficient implementations on von Neumann machines. Newer programming styles, especially declarative languages like functional or logic programming, are incompatible with the von Neumann architecture. Even the lowly goto, whose use is considered harmful by structured programmers, is fundamental to the von Neumann style. Although we will not cover them in depth here, several modifications to the von Neumann architecture have been proposed. Vector machines have instructions that operate on arrays instead of single binary words. By providing many channels between the processor and memory and by "pipelining" the processor, much higher throughput can be obtained. The use of vector instructions reduces the overhead on the processor-memory channel through a reduction in the number of instructions fetched per useful operation performed. Massive parallelism may be employed by connecting tens, hundreds or thousands of processing units together. Thus, a very large number of operations can be performed concurrently by the processors. Operands and results are exchanged by the processors when necessary. The notion of a linear program representation and sequential instruction execution can be eliminated. This is the approach pursued in "dataflow machines" that use data dependency graphs as their program representation. A data dependency graph shows the producer-consumer relationships between program operations. If a sufficient number of physical processing elements are available, many producers can be executing at once. This chapter is only a brief introduction to computer architecture. As architecture moves away from the von Neumann paradigm, the field poses many exciting engineering challenges for the future. Copyright (c) 1987-2013 Paul J. Drongowski