Computer Design A computer aided design and VLSI approach Paul J. Drongowski Chapter 7 - Building blocks. As explained in Chapter 6, good engineering design involves both top-down and bottom-up design. While top-down design permits an appropriate and comprehensible decomposition of system function, bottom-up design is needed to accommodate the real engineering concerns of the implementation. This chapter describes a set of building blocks for digital system design. We will briefly describe each block and give control information. The physical characteristics of the blocks are summarized in the final section and may be used in the organization level design of the SP.1. Section 1 - Register storage. Registers provide small amounts of relatively fast store. A "simple" register stores one word of data. Data is read and written in parallel using the data inputs (X) and outputs (Y), respectively. Normally, the outputs are always driven and the data stored in the register is available without an explicit "read" operation. One or more control lines must be provided for the "load" or "write" operation, however. In the case of standard 74LS-series logic, the enable signal is usually a gated clock edge whose leading (or trailing) edge evokes the load operation. Figure 2 shows how a clock edge may be gated (conditionally controlled) using an external AND gate. This function may be included in the operation of the register in which case the "write enable" and clock signal(s) must be distributed to the register instead of a gated-clock. --- | | X ---->| |----> Y | | --- ^ | Enabling signals Figure 1 - Simple register. --- Write enable ---| | | * |----> Gated edge Clock ----------| | --- ____ ____ Clock ____| |____| |____ _________ Write ____________| |_ ____ Gated ______________| |____ Edge Figure 2 - Gated clock. Optionally, a simple register may include one or both of the intrinsic operations "preset" and "clear." All bits in the register will become one when preset is evoked. The register will be set to zero when clear is performed. A simple register of word length N can be constructed by interconnecting N one bit storage elements (or flip-flops.) The N elements share the same clock, enable, clear and preset control signals. Thus, each of the elements will behave in concert with the others. The process of building multi-bit devices from a single bit design is called "replication." Replication is a powerful mechanism since a single design can be used many times although it is drawn and tested only once. VLSI systems in particular make extensive use of replication because custom circuit layout and verification is an especially time-consuming and expensive task. If a design is largely composed of a few replicated elements, it is said to be "regular." Simple registers appear frequently in digital design. They may be used to hold intermediate results, general register values, interface data, instructions and processor conditions. A "counter" is similar to a simple register. However, it can also perform one or both of the intrinsic operations "increment" and "decrement." (Increment and decrement are sometimes called "count up" and "count down.") Increment and decrement add and subtract one from the register, respectively, when the clock is asserted. The count value is initialized using the "load" and "clear" operations. In Figure 3, separate control lines are drawn for the four intrinsic operations and the clock. -- Clear | --- Load || ---- Increment ||| ----- Decrement |||| VVVV ---- | | X ---->| |----> Y | | ---- ^ | Clock Figure 3 - Counter. A counting register may be used as a program counter (instruction pointer), auto increment or decrement address register, word count register for block transfers or push-down stack pointer. Any situation where a value must be increase or decreased by one is a potential application. A "register file" is shown in Figure 4. A register file is a one-dimensional array of binary words. An address or index input must be provided to explicitly select a word for writing or reading. The state of the read/write select input will specify which of these operations should be performed. ---- Read/write select | --- Address (index) | | V V ---- | | X ---->| |----> Y | | --- ^ | Clock Figure 4 - Register file. One of the most common uses for a register file is to implement the general register set for a machine ISA. If any spare words are available within the file beyond those needed for the general register set, the spares can be used as temporary locations or as any other register in the ISA. The shift register is a simple register with two additional intrinsic operations, "shift left" and "shift right." When the clock is asserted, each bit will move one position to the right or left. Inputs and outputs are provided at either end of the shift register to handle data entering and leaving the register as the result of a shift. If an arithmetic shift or rotation is required, the inputs must be set to the appropriate "boundary" values. In the case of a right arithmetic shift, for example, the sign bit must be reproduced at the left end of the word. -- Clear | --- Load || ---- Shift right ||| ----- Shift left |||| VVVV ---- Right in ---->| |<---- Left in X ---->| |----> Y Left out ---->| |----> Right out ---- ^ | Clock Figure 5 - Shift register. Arithmetic, logical and rotation shifts can be implemented using a shift register. The shifts can be performed "in place" without separately reading, rearranging and writing the contents. A shift register may also be used when multiplication or division by a power of two is required. Section 2 - Data operators. Data operators combine and transform digital information as it passes to and from storage. Data operators are "functional" in the mathematical sense. The output of a data operator is not history sensitive and depends solely upon the value of the inputs. Data operators are usually implemented using combinational logic. --- | |----> Z0 X0 ---->| |----> Z1 X1 ---->|2:4|----> Z2 | |----> Z3 --- Figure 6 - Two decoders. The simplest data operator is the decoder. The decoder has N inputs and 2^N outputs. For each value of N, only one output will be asserted. Two decoders (two to four and three to eight) are shown in Figure 6. Decoders come in two forms: high true and low true. In the high true form, only one of the outputs will be high while the rest are low. The opposite situation holds for low true decoders, i.e., one line is asserted low while the others remain high. A decoder can be used to encode control information. For example, the operation code field of an instruction can be decode into individual control lines, one line for each instruction operation. When reading or writing a register file or memory element, one word must be selected for the operation. It is a decoder that maps the address (index) to the control line that will select the appropriate word. Figure 6 - Full adder. Figure 7 - Ripple carry adder. The interpretation of an ISA requires the evaluation of many arithmetic and logical operations. The most common of these operations are: R = A + B + Cin Addition with carry-in R = A - B - Bin Subtraction with borrow R = A & B Bit-wise logical AND R = A | B Bit-wise logical OR R = A @ B Bit-wise exclusive-OR R = ~ A One's complement (logical NOT) R = ~ A + 1 Two's complement (negative) Addition appears so frequently that an adder is regarded as a basic building block. (Early machines exploited the internal logic of the adder to form other operations as well.) Subtraction is easily obtained from addition by taking the one's complement of the right hand argument and forcing the carry-in of the adder to one. Increment by one and decrement by one may be constructed in a similar manner. Figure 8 - Lookahead adder. The design of addition circuits has been extensively studied. The two most widely recognized implementations are the "ripple carry" and "carry lookahead" adders. An N-bit ripple carry adder consists of N full adder circuits where each full adder produces a one bit sum with carry-out. By interconnecting the carry-out of each stage to the carry-in of the next higher stage, a chain of full adders is formed. Carries will propagate through the chain one stage at a time. Thus, the worst case delay will depend upon the length of the carry path. The lookahead adder speeds the addition process by generating and propagating two special signals (G and P in Figure 8) across a group of individual adders, thereby skipping some stages of the carry chain. Total delay time will depend upon the group size and the number of groups in the generate and propagate chain. Texas Instruments introduced the first monolithic arithmetic logic unit (ALU) in 1973. The 74181 and its successors use carry lookahead logic for speed and compute all possible arithmetic and logical operations. Since then, the ALU has become an essential building block for computer design. Figure 9 - Arithmetic logic unit. An ALU block has two parallel data inputs for the left and right operands and one parallel data output for the result. The operation code input selects the arithmetic or logical computation to be performed. External logic forces the carry-in to a constant value or the carry-out of some previous computation for multiple precision arithmetic. In addition to carry-out, the ALU must produce condition signals for a negative or zero result and arithmetic overflow. --- A ---->| |----> A < B | |----> A = B B ---->| |----> A > B --- Figure 10 - Comparator. Often a programmer is only interested in comparing the values of two quantities and then making a decision based upon that comparison. This function is performed by the "comparator" block (Figure 10.) Testing for zero is even simpler. The logical NOR of all the bits in the word will be true when all of the bits are zero. It is sometimes necessary to rearrange the bits within a binary word. Shift left and shift right are rearrangements that move each bit one position to the left or right. Byte swap instructions which exchange the position of the upper and lower bytes within a word are another kind of rearrangement. Finally, some implementations of the Fast Fourier Transform (FFT) require a complex interchange of address bits for shuffle-exchange operations. The FFT rearrangement is especially expensive (with respect to runtime) to implement in software. Figure 11 - Shift and swap networks. Luckily, hardware rearrangement is easy to accomplish -- two components are interconnected in accord with the desired rearrangement. Figure 11 illustrates wiring networks for left shift, right shift and byte swap operations. In the case of the shifts, "boundary" values must be supplied to set incoming bits to the appropriate values. When implementing rotation, for example, the outgoing bits are wrapped around to the incoming positions forming the rotation. Section 3 - Programmable logic arrays. Semiconductor manufacturers provide families of standard building blocks that cover the most common design situations. Thus, one expects to find registers, counters, adders or ALU's in the family. However, an application may demand a combinational function (or data operator) that is so unique that it is not included in an industry standard family. The obvious solution is to implement the combinational function using a network of NAND and NOR gates. This is an expensive solution if there are a large number (greater than 4) of inputs, outputs and terms. A "random" gate implementation will require several circuit packages and area for the interconnecting wires. To satisfy these specialized needs, vendors began selling programmable logic devices (PLD.) Internally, a PLD consists of a regular array of gate (and sometimes storage) elements in which the final interconnections have not yet been determined. The desired logic function is obtained by physically altering the device thereby making the final interconnect. This is called "device programming" since the process is similar to machine level programming. Two kinds of manufacturing and programming schemes are in common use. 1. All of the gates are formed on the integrated circuit chip with the exception of the final wiring step. The designer determines the configuration of the wires which are then added to the chip. "Mask programmable" devices of this type are suitable for high volume applications as the final manufacturing step is relatively expensive. 2. Electrical connections are established between all of the inputs, outputs and gates. The array is programmed by selectively breaking any unwanted connections just like blowing a fuse. "Fusible link" devices can be programmed in the field (e.g., on a lab bench.) Field programmability comes with a cost, however. Mask programmable devices have a higher gate density than field programmable circuits and can accommodate larger combinational functions. If the designer is careful, the gate network can be partitioned so that more than one distinct logic function can be implemented within a given PLD. Thus, the higher density mask devices can support a larger number of logic functions as well. Programmable logic devices have several advantages over random gate implementations. First, the entire logic function can be implemented in a smaller number of circuit packages, perhaps in just one pack. This saves board and interconnect area. Next, less current is required with a corresponding decrease in power (heat) dissipation. Driving signals off-chip across a circuit board is a power hungry activity. The gates in a PLD are local to one another and do not require the higher drive capability of off-chip circuits. Third, manufacturers can produce programmable devices in large quantities since a common circuit base can satisfy many customers through programmability. Finally, a design is easily changed by reprogramming the device. Figure 12 - Two level programmable logic array. The most common kind of PLD is the programmable logic array (PLA) which implements the sum of products form of a logic function. A PLA consists of an AND-plane that contains the product terms and an OR-plane that produces the sum of the product terms. The AND-plane is driven by a set of buffering circuits that derive the true and complement form of each input. Several inputs and outputs are permitted. The exact number is determined by the specific part which is selected for the design. The PLA in Figure 12 implements a full adder. The three inputs are the left and right operand bits and the carry-in. The two outputs are the sum and carry-out. In addition to specialized combinational blocks, PLA's may be used to constructed finite state machine controllers and even read only memory (ROM.) Section 4 - Memory. As the name implies, ROM is a storage element that is written once and is only read thereafter. Those portions of the ISA which are fixed are candidates for ROM implementation. In the PDP-11 series machines, for example, interrupt entries, the input/output device registers and even the condition code register are addressable as memory locations. In the case of the condition code register, the processor must recognize the octal address 0177776 as the location of the condition code register and either change or read the condition codes. The constant 0177776 must be stored somewhere and a ROM is the likely location. Control information such as the microprogram emulating the ISA is usually stored in ROM. As noted in Chapter 2, the availability of inexpensive high capacity ROM led to Complex Instruction set Computers. ---- Chip select | --- Address | | V V ---- | | | |----> Output | | ---- Figure 13 - Read only memory. A read only memory has an address input and a data output. ROM's may be concatenated horizontally to build wider words. If more memory locations are needed, ROM's may be stacked vertically. The ROM circuits share the same low order address lines. When many ROM packages are connected in an array, high capacity buffering circuits are required to drive all of the address inputs in the array. A chip enable input is provided to assist the stacking process. The high order address bits are first decoded. Then the outputs of the decoded are connected to the chip enable inputs of the ROM circuits that form a complete memory word. Figure 14 - ROM implemented by PLA. Read only memory can be implemented using a programmable logic array (Figure 14.) The address is applied to the inputs of the AND-plane where they are fully decoded. Since each product term is unique, only one "row" in the OR-plane will be enabled. By programming the data values into the OR-plane, the ROM data is selected and sent to the output of the PLA. Please observe that a PLA is more versatile than a ROM because the AND-plane permits the use of "don't care" terms in addition to the true and complement values of the inputs. Thus, a PLA can ignore a condition or address input. For this reason, a programmable logic array is better suited for the implementation of finite state machine controllers that must respond to or ignore external stimuli. ---- Chip select | --- Address | | V V ---- | | Input ---->| |----> Output | | ---- Figure 15 - Read/write memory. Random access memory (RAM) is both readable and writable. The term "random access" refers to our ability to address any single location in memory at any time. In "sequential access" memory, however, information is only accessible in a serial stream of data. To obtain a word of information in the middle or at the end, the memory is scanned from the first word and through each succeeding word until the target word is found. A magnetic tape drive is a sequential access device. Serial access is generally much slower than random access since all data items must be scanned in sequence every time as opposed to the direct accessibility of RAM. RAM circuits may be concatenated and stacked to obtain wider words and bigger memories. As in the case of ROM, the common address lines may need buffering and external address decoding will be required for stacked arrays. Most RAM today is electronic and will lose its data if power is lost. Although the speed of electronic RAM approaches register speed, the high density afforded by RAM make it the best choice for primary memory. RAM may also be used to construct cache memories and writable control stores. A cache memory speeds up access to primary memory data by maintaining frequently used values within the processor itself, thereby avoiding relatively lengthy processor to memory transactions. A writable control store (WCS), like a ROM control store, holds the ISA microprogram. Since it is writable, however, the microprogram may be changed and programming bugs are easily fixed. The customer may also run a different microprogram possibly emulating another ISA. Section 5 - Selection. Data selectors allow the machine to select among one of several different arguments or results. For example, if two different data values must be stored in the same register (at two different times, of course!), the values must be routed to the register through a selector. The value of the control input to the selector will choose one of the two inputs at execution time. --- Input0 ---->| | Input1 ---->| |----> Output ... | | InputN ---->| | --- ^ | Select Figure 16 - Multiplexer. Gate | ----- / Input ____/ ____ Output Figure 17 - Switch. Three kinds of selection elements are available: multiplexers, switches, and bi-directional data buffers. A multiplexers works exactly like the selection element described above. It has N data inputs, one selection input and one data output. A switch element is a voltage (or current) controlled switch. nMOS pass transistors and CMOS transmission gates are two examples of a switch circuit. The path from the input to the output is controlled by the gate input. When the gate input is high, the path is established and the information flows through the switch. Data may be routed by connecting switches into a series/parallel network and sending selection information to the gates. Input0 Input1 Input2 | | | @-Enable @-Enable @-Enable | | | ----------------------------------- Shared bus Figure 18 - Bi-directional data buffers. Figure 18 shows another alternative for selection. In this illustration, three inputs are connected to a shared bus through bi-directional data buffers. An input value is placed onto the bus by asserting the enable input of its bi-directional driver. Since the shared bus cannot be simultaneously driven by two or more senders, just one of the "Enable" inputs may be asserted at any time. TTL tri-state buffers are an example of a bi-directional data buffer. The output of a tri-state buffer is turned on (driven) when deliberately enabled. When the output is turned off, it places the driver into a high impedance state which permits one of the other drivers on the bus to send data. The advantages and disadvantages of the three techniques break down in the following way. Multiplexers are available in every circuit technology (logic family.) They are active circuits that consume power and force input wires to be routed to a central place. Switch elements consume very little or zero power. Due to signal losses through the switch, restoring logic (amplification) is required. They are not available in all circuit technologies. Bi-directional data buffers are excellent for distributed busses. Care must be taken, however, to guarantee a single sender at any time. Long busses (e.g., across a chip or backplane) must be driven hard for high speed with a corresponding increase in current consumption and heat dissipation. Clearly, the choice of one scheme over another is an engineering decision affected by the overall design context. Section 6 - External interface. External communication often demands special circuits and provisions for compatibility. For example, terminals, modems and computers intercommunicate using the RS-232C electrical standard. RS-232C specifies the voltage levels, connector pins and signal interpretations that permit any device adhering to the standard to correctly exchange information. When an interface becomes a standard, manufacturers will produce components that directly support the standard making design task easier. RS-232C receivers and drivers and asynchronous line interfaces, for example, are readily available. VLSI technology has special requirements for external communication. Signals must eventually be driven off-chip to other subsystems and of course, must be received as well. The microscopic features of an integrated circuit chip must first be connected to the macroscopic world of printed circuit boards and wires. This function is performed by the package which surround and protects the chip from mechanical damage. Thin gold wires are bonded to the chip and the package pins forming the electrical bridge. Figure 19 - External interface. To support off-chip connections, large metal regions called "bonding pads" must be allocated on the chip (Figure 19.) The pads are positioned around the edge of the chip permitting easy access for wire bonding. The pads are physically large and consume a substantial portion of total chip area. Further, the capacitance of off-chip wires is much larger than that of on-chip wires. Big drive transistors must be designed into the interface to drive external connections (charge and discharge the wires) without a performance penalty. As much as 40 percent of total circuit power may be dissipated by the output drivers. External interface circuits are not just big -- they may be power hungry, too. Therefore, there is considerable pressure to minimize off-chip connections and keep the pin out low. Section 7 - Physical characteristics. All physical circuits occupy space, consume electrical current and generate heat. They take some time to switch on and off and to communicate -- delay. The design projects in this book use a set of circuit primitives that a similar to "standard cell" integrated circuit technology. Area, supply current, power dissipation and worst case delay are summarized for each building block in the table below. Standard cells Blocks Supply Power Delay ---------------------- ------ ------- ------ ------- 8 x 8 register file 4 x 5 7 mA 35 mW 30 nsec 16 x 16 register file 8 x 10 25 mA 125 mW 35 nsec 8-bit register 4 x 1 0.8 mA 3.7 mW 20 nsec 16-bit register 8 x 1 1.5 mA 7.5 mW 20 nsec 8-bit counter 5 x 1 1.0 mA 5 mW 20 nsec 16-bit counter 10 x 1 2.0 mA 10 mW 25 nsec Control store 12 x 6 50 mA 250 mW 35 nsec ALU 8 x 4 20 mA 100 mW 30 nsec 16-bit 2:1 mux 4 x 1 0.8 mA 4 mW 10 nsec 16-bit 4:1 mux 4 x 1 0.4 mA 2 mW 10 nsec 8-bit external input 8 x 2 0 mA 0 mW 0 nsec 16-bit external input 16 x 2 0 mA 0 mW 0 nsec 8-bit external output 8 x 2 20 mA 100 mW 100 nsec 16-bit external output 16 x 2 40 mA 200 mW 100 nsec 8-bit ext. tristate 8 x 2 25 mA 110 mW 100 nsec 16-bit ext. tristate 16 x 2 45 mA 225 mW 100 nsec Copyright (c) 1987-2013 Paul J. Drongowski