Computer Design A computer aided design and VLSI approach Paul J. Drongowski Chapter 10 - Style. Chapters 8 and 9 on basic datapath and control design give us the tools to examine stylistic issues in processor design. A designer does not look upon the system as a "random" collection of components and events. Rather, she or he organizes the system structure in accord with a consistent theme that enhances the comprehensibility of the design. Humans naturally try to produce order from disorder and that is the function of style in computer design. This chapter presents three kinds of choices each in datapath and control design. The key trade-offs in using a particular style are: * The number of datapath operations which can be performed in parallel, * The number of control events needed to execute an aggregate system operation, * The number of components needed in the datapath or control design (with direct effect on current, space and cooling requirements), and * Interconnection complexity. Of course, one does not have to rigidly follow a style in the extreme. If execution speed can be increased or component count reduced by choosing a "hybrid" of two or more styles or by specializing the design to exploit a feature of the architecture, then by all means the designer should take advantage of the situation! Section 1 - Busses. The naive synthesis procedure given in Chapter 9 will usually yield a "chaotic" datapath since its structure is driven solely by the data dependencies in the ISA specification. The advantage of this scheme, however, is that the datapath can be highly specialized for the ISA. We call this the "zero bus" style. The zero bus style is compatible with an event scheduling strategy that minimizes the number of control steps in order to increase execution speed. All transfers that can be performed in parallel (subject to flow dependencies between the computation and use of a variable) are set to execute in one control event. Flow dependencies where the value of a variable must be computed and stored in one event in order to be used in a subsequent event, will govern the assignment of transfers across sequential events. The zero bus style will result in the fastest possible schedule (shortest sequence of control steps), but will be expensive in component count and interconnection complexity. All expressions must be fully evaluated in one control step and little component sharing is allowed. A component can serve two or more uses only if each use can be scheduled independently across two or more control steps. It may be necessary to share expensive (large or power hungry) resources such as a multiplier to meet spatial, current or cooling constraints. In that case, the common resource must be "multiplexed" in time as only one operation may be active at any time. The operations using the resource, therefore, must be serialized into sequential control events. This is a classic space-time trade-off. To implement the system as a VLSI circuit or printed circuit board, the components must be position in a two dimensional (2-D) plane and the wires must be routed according to the schematic. Clearly, the number of wires and distinct point to point connections will affect the amount of board or chip area needed for interconnect. (On-chip wiring, for example, may require as much as 80 percent of total chip area.) Further, wire routing in VLSI circuits is an expensive task. If done manually, routing is a very time-consuming and error prone task. Although routing programs are faster and more accurate, they still consume large amounts of memory and computer cycles at a very real cost to the manufacturer. An automatic router may be able to route only 95-99 percent of the system leaving the last 2-5 percent for manual layout. Write Read | --- | |---->| X |---->| | --- | | --- | |---->| Y |---->| | --- | | --- | |---->| Z |---->| | --- | |--- | | | | | V | | --- | | | T | ----- | --- | | | | | --------- | | ALU | | --------- | | ------ Figure 1 - Example two bus datapath. To reduce routing complexity, designers often introduce shared "busses" into their designs. A bus also make the system expandable if it is brought out to a readily accessible connector. Figure 1 depicts a typical two bus datapath. The bus on the right is used for reading operands from the storage elements X, Y and Z. The read bus is routed to the right hand input of the arithmetic logic unit. ALU results are written to X, Y and Z from the bus on the left. Note that the temporary register T is needed to hold the left hand operand to the ALU. Two control steps will be required to perform the addition "Z <- X + Y" as shown in Figure 2. ------------- | Move X to T | |-------------| | T <- X | ------------- | V ------------- | Add & store | |-------------| | Z <- T + Y | ------------- Figure 2 - Two bus sum "Z <- X + Y." Like all shared busses, the read and write bus can have just one transmitting component at any time to prevent data collisions. Bi-directional buffers, switches or multiplexers may be used to implement the read bus. All data transfers must be serialized across the read bus. This disadvantage of the two bus approach was noted above -- dyadic data operations require two control steps. The three bus approach alleviates this problem at the cost of a third bus. An operand may be placed onto either read bus and sent to either input of the ALU (Figure 3.) If dyadic operations are used infrequently, then the cost of the third bus may be too high and the two bus approach would be preferred. Dyadic operations (add and multiply) are critical to signal processing applications and the cost of third bus may be justified solely on the need for speed. Write Read Read | --- | | |---->| X |---->|---->| | --- | | | --- | | |---->| Y |---->|---->| | --- | | | --- | | |---->| Z |---->|---->| | --- | | | | | | --------- | | ALU | | --------- | | ------------------ Figure 3 - Example three bus datapath. J-K master-slave flip/flops will be required for resisters that must be read and written in the same clock cycle (e.g., to perform transfers such as "X <- X + Y.") Why? As we will see later, the same capability can be gained through the proper use of a two-phase non-overlapping clock. Section 2 - Storage. The naive register assignment procedure in Chapter 9 calls for a separate register or register file for each ISA variable. The designer could, however, assign all of the ISA variables to a single register file. This extreme assignment scheme would reduce the complexity of interconnection and in the case of a two bus datapath, would exploit the addressing mechanism of the register file to select read bus operands. All transfers in or out of the register file will be serialized, however. The opportunity to assign several variables to a single file usually arises when RAM locations are left over after the ISA general registers have been allocated to the register file. Thus, the designer is getting some registers "for free" and wants to take advantage of the situation. The decision to assign or not to assign to a given variable should be based upon the frequency of use of the variable. Those variables which see only occasional use should definitely be allocated to the register file. High traffic variables such as those needed for memory address computation or instruction fetch and decode are best allocated to a separate register block. The use of a register file or RAM also precludes the exploitation of intrinsic operations. Address Address A B | | V V ----------- | | | |----> Output A ---->| | | |----> Output B | | ----------- | W/R Figure 4 - Dual port RAM. The disadvantages of a standard, one word at a time register file can be removed if a dual port RAM is used instead. A dual port RAM permits two memory words to be read simultaneously. Two independent address and data output ports are required. A dual port RAM can be used to good advantage in a register to register architecture. Simultaneous access to two general registers means that the interpretation of most register to register instructions will require at most one control step. When a dual port RAM is not available as an off the shelf building block, its operation can be simulated using two standard RAM's (Figure 5.) A data item is written to each RAM in tandem, keeping the contents of the two RAM's consistent. Address Address B A | | | V | --- --------->| R |----> Output A | | --- | | --->| | | V | --- ->| R |------------> Output B --- Figure 5 - Simulated dual port RAM. Section 3 - Operators. Component count is reduced by replacing two or more single function data operators by a more "universal" component that covers all of the needed functions. The aforementioned restrictions on datapath parallelism will apply as functions that were previously performed in parallel are sequentialized through the new common component. Intrinsic operations may be exploited at the cost of increased control complexity and marginally higher space and power. Up to this point we have freely traded components and have assumed that component delay times are equal. Data operators have a much wider range of performance. Delay times are given in Table 1 for simple arithmetic and logical operations for both specialized and ALU implementation (on sixteen bit words.) Simple logical functions take longer through the ALU, but the carry lookahead logic of the ALU speeds addition considerably. The ALU dissipates more power in all cases. Depending upon the characteristics of the building blocks, therefore, the execution time of each control step may actually be increased by replacing simple data operators with a more general component. An increase or decrease will affect the execution time of all control steps in asynchronous system. Special function block | 74LS181/182 -------------------------------------------- Not 9.5 ns 32 mW | 22.0 ns 400 mW And 12.0 ns 68 mW | 22.0 ns Or 12.0 ns 80 mW | 22.0 ns Add 165.0 ns 300 mW | 44.0 ns Table 1 - Special vs. general data operators. Section 4 - Control style. Two different implementation styles are available for the implementation of the controller. In the "hard-wired" style, a conventional sequential machine is designed using either the state table, delay element or sequence counter method. The "microprogrammed" style is a variant of the state table method. The microprogram (state table) is stored in a memory element from which microinstructions (control steps) are selected for execution. Control implementation will be discussed in a later chapter. However, the comparative advantages and disadvantages can be addressed here. Microprogramming is a popular approach because once the datapath design is fixed, the machine is programmed to emulate the ISA -- a task which is perceived to be easy. Bugs are easily corrected by changing the microprogram and distributing a new version of the binary microcode in ROM or on floppy disk (for machines with a writable control store.) Depending upon the implementation method used for hard-wired control, corrections will range from easy (state table) to difficult (sequence counter.) There is a tendency when designing a microprogrammed engine to bloat the size of the microstore. Horizontally microprogrammed engines have a low degree of microinstruction encoding and microinstructions are long (e.g., greater than 48 bits). The designers are tempted to add new "features" with an increase in microprogram length. Horizontal microwords plus "featuritis" is a bad mix as the 2-D control store grows rapidly. Simple machines such as Reduced Instruction Set Computers are a reaction to the excesses of horizontal microprogramming. RISC's rely upon small grain functions that can be executed in a single control step with minimal macroinstruction decoding. Extensive microprogramming is unnecessary. Microprogramming is still a viable implementation technique if designers can exercise self-restraint and limit the ISA and code size. Section 5 - Pipelining. As C or Pascal programmers, we are used to thinking about the sequential execution of programs. The discussion on datapath parallelism demonstrates that hardware designers have the option to perform several operations concurrently if sufficient functionally independent resources are available. This kind of concurrency is called "horizontal" or "spatial" concurrency. If we observe that two operations A and B are always performed in series: ----------------------------- | A | B | A | B | A | B | ....... ----------------------------- then the operations A and B can be overlapped in time: --------- | A | B | ------------- | A | B | ------------- | A | B | --------- | ........... -------- thereby shortening the overall execution time. This kind of concurrency is called "pipelined" or "vertical" concurrency. The term "pipeline" derives from a conceptual pipe of processing elements: --- --- | A |----->| B | --- --- in which a result produced by A is processed by B while A is producing the next result. Although A and B form a "two stage" pipeline, the idea can be extended to an N-stage pipeline as long as there are enough independent tasks and processors available. ------------------- 1: | Fetch | Execute | ----------------------------- 2: | Fetch | Execute | ------------------------------- 3: | Fetch | Execute | --------------------- Figure 6 - Instruction lookahead. Instruction lookahead is a pipelining technique which prefetches an instruction while the current instruction completes execution. The execution sequence for three instructions in the two stage pipeline: ----------- ------------ | Fetcher |------>| Executer | ----------- ------------ is given in Figure 6. The Berkeley RISC machine has three stages (Figure 7.) Instructions pass through three phases: fetch and decode, result computation and result store. ------------------------------- 1: | Fetch | Compute | Store | ----------------------------------------- 2: | Fetch | Compute | Store | --------------------------------------- 3: | Fetch | Execute | Store | ----------------------------- Figure 7 - Berkeley RISC pipeline. Pipelining assumes that the stages will operate independently and that the stream of data will be uninterrupted. Unfortunately, this assumption is not true in practice. Branch instructions and interrupts disturb the regular flow of instructions through the pipeline. Results produced by an instruction are often needed by the next instruction. The harshest solution to the interdependence problems above is to empty or "flush" the pipeline and restart execution. If the effort to set-up the pipeline is substantial as in the case of long a pipeline, this solution is expensive. One can also anticipate pathological cases where the length of the pipeline is longer than a short loop. One approach to the problem of branching and interrupts is the "delayed branch." In this scheme, the normal flow of the pipeline is not immediately disturbed. The prefetched instruction is executed while an instruction is fetched from the (new) target location. Thus, the effect of the interrupt or branch is delayed for one or more execution cycles. Berkeley employed the delayed branch technique in their RISC. In any parallel system, results must be produced before they are consumed. A close examination of Figure 7 will reveal a potential problem between the store and compute stages of the Berkeley pipeline. During the compute phase, operand values are read from the general register set (which was implemented with a dual port array.) If the store and compute phases are overlapped, however, a needed result may not have been written into the general registers before it is read during the compute phase of the next instruction. The Berkeley team solved this problem by keeping the phase two result in a separate data buffer and by detecting the interdependence between the two instructions. Rather than reading the wrong value from the register array, the operand is retrieved from the buffer instead. Section 6 - Clocking. One major drawback of synchronous design with a single phase clock is that the minimum clock period is determined by the speed of the slowest delay path through the design. Results must be computed fast enough to guarantee the minimum set-up time for reliable data storage. A programmable clock alleviates this problem by giving the designer a choice of clock periods. While constructing the microcode (or hard-wired controller), the engineer can select the minimum clock period to perform the desired data operations. In order to handle the specialized data operators in the first column of Table 1, the designer might use a clock with two selectable periods of 45 and 195 nanoseconds (assuming a combined 30 nanosecond set-up and hold time.) The two event sequence requires 390 and 240 nanoseconds to execute under fixed and programmable speed clocks, respectively. Fixed Programmable ------------- | Short op | 195 ns 45 ns |-------------| | Z <- X & Y | ------------- | V ------------- | Long op | 195 ns 195 ns |-------------| | Z <- X + Y | ------------- Figure 8 - Graph under fixed and programmable clocks. Every control step has some overhead associated with its execution. A non-pipelined microprocessor, for example, must fetch a microinstruction before the "real" work is begun. The fetch time is overhead. By breaking the clock period into phases and assigning different operations to each phase, we can accomplish more work per control step. Figure 9 contains an example of a three phase clock. (We will discuss the idea of a two-phase non-overlapping clock in a later chapter. This idea is similar, but has some subtle differences.) The clock period is broken into three phases and operations are assigned to each phase. Operations that complete in a short time are controlled by Phase 1 while the slowest are enabled by Phase 3. Pulses are generated at different times during the clock period. ____ Phase 1 _______________| |_____________________ ____ Phase 2 ________________________| |____________ ____ Phase 3 __________________________________| |__ Figure 9 - 3-phase clock example. Copyright (c) 1987-2013 Paul J. Drongowski