Use of the TI TMS320C30 Chip in the Design of a Programmable HDTV Video Decoder
By John Wiseman(Originally presented at the International Conference on Signal Processing Applications and Technology - Santa Clara, CA - September, 1993)
Abstract
This paper describes the architecture and hardware design philosophy behind the development of a set of programmable video decoder boards utilizing the Texas Instruments TMS320C30 DSP chip. These boards are part of a programmable HDTV decoder system designed and built by Panasonic Advanced Television & Video Laboratories of Burlington, NJ as part of an ongoing HDTV R&D program. The paper concentrates mainly on the Dequantization (DQ) Board, as it is completed and fully functional in a partial system. Mention is made of the Variable Length Decoder (VLD) Board currently being designed with two C30 chips per board.System Description
The ATVL HDTV decoder system has a video clock of 74.25 MHz. To allow a reasonable hardware implementation, the video is divided into four equally sized vertical strips. In this manner, four mostly independent hardware sections can process four video data streams running in parallel at approximately 18.6 MHz. It should be noted that with this timing, one section alone could support decoding of NTSC video signals. Each section is fed a demultiplexed video stream by a Deformat/Router Board; a Bitstream Computer Interface Board in turn feeds this. This board, combined with a PC on the input, simulates a digital receiver and sends a raw digital bitstream to the decoder at rates up to 28 Mb/sec. Decompressed video from the four sections is recombined into a single composite signal by the Section-To-Raster Converter Board, and then sent to the D/A Board before being displayed on an HDTV monitor.
When this project was started, an FCC standard for HDTV broadcasting in the U.S. had not been finalized. As such, an expanded MPEG I syntax was developed, one that would allow for extensions such as frame/field modes, multiple video strips, sect. This hybrid syntax is followed in the design of the system described here.
Dequantization Board Description
DQ Board Algorithm
The algorithm that the DQ Board implements is taken directly from the MPEG I video specification [1]. For intra macroblocks, specifically luminance blocks, the algorithm is described by;
for (m=0; m<8; m++) {
for (n=0; n<8; n++) {
i=scan[m][n];
dct_recon[m][n]=(2*dct_zz[I]*quantizer_scale*intra_quant[m][n])/16;
if ((dct_recon[m][n]&1)==0)
dct_recon[m][n]=dct_recon[m][n] - sign(dct_recon[m][n]);
if (dct_recon[m][n] > 2047)
dct_recon[m][n]=2047;
if (dct_recon[m][n] < -2048)
dct_recon[m][n]= -2048;
}
}
dct_recon[0][0]=dct_zz[0]*8;
if ((macroblock_address - past_intra_address > 1))
dct_recon[0][0]=128*8+dct_recon[0][0];
else
dct_recon[0][0]=dct_dc_y_past + dct_recon[0][0];
dct_dc_y_past=dct_recon[0][0];
For the following luminance blocks in the macroblock, the main processing remains the same, but the DC differential algorithm is as follows;
dct_recon[0][0] = dct_dc_y_past + dct_zz[0]*8;
dct_dc_y_past = dct_recon[0][0];
Cb and Cr chrominance blocks are handled in the same manner, with DC differential values carried over macroblock boundaries to respective Cb and Cr blocks independently.
Processing of non-intra macroblocks is quite similar to that of intra macroblocks, with two differences. First, there is no DC differential calculation. Second, since the non-intra macroblocks are quantized with a dead-zone quantizer (as opposed to intra macroblocks being quantized with a uniform quantizer), the line;
dct_recon[m][n]=(2*dct_zz[I]*quantizer_scale*intra_quant[m][n])/16;
is modified to ;
dct_recon[m][n]=(((2*dct_zz[I] + sign(dct_zz[i]))*quantizer_scale*intra_quant[m][n])/16;
After the reconstructed DCT coefficients are found in the above manner, an inverse DCT operation is performed. The pixel domain data is then multiplexed at 37 MHz. for transmission to the Motion Compensation Board for further processing.
Partitioning of Functionality
To further increase the time available to process video information, each video section is subdivided into two separate channels, with the even blocks processed in the B0 channel and the odd blocks processed in the B1 channel (Figures 1, 2). These two streams of quantized DCT coefficients are sent to the DQ Board from the VLD Board via a pair of high-speed FIFOs. These FIFOs allow a somewhat asynchronous transfer of coefficients from the VLD, with input timing designed to support up to a 30 MHz. coefficient input. Coefficients are read out of the FIFOs in the manner shown in Figure 2, timed by a timing generator located in one of the output Xilinx FPGA chips. Data registers are provided at the FIFO outputs for timing purposes. Within a relative macroblock time, macroblock side information is transmitted from the VLD Board. This information currently consists of several 9-bit code words, describing macroblock type, whether or not a new quantizer scale value is to be used, and synchronization information. The fourth FIFO receives a new quantization matrix (if sent, at the sequence level), also transmitted from the VLD Board.
The scan[m][n] function is handled by the Coefficient Reordering Buffer Memory, immediately following the FIFO data latches. This is physically comprised of a high-speed dual-port RAM, whose input data coefficients are supplied in raster scan order (this is the default; it may be arbitrary) is implemented by an address translation of the input address, via a PROM look-up table. The 2Kx8 PROM allows up to 32 different inverse scanning patterns to be implemented, each selectable via a 5-bit code in an external processor register.
The functions (2*dct_zz[i]) for intra macroblocks, and ((2*dct_zz[i] + sign(dct_zz[i]) for non-intra macroblocks are implemented via a look-up table in the Data Format PROMs. The appropriate macroblock function is selected by a bit in an external processor register, written to once every macroblock cycle. Provision is made during intra processing to disable this function during the first coefficient access (DC coefficient) to allow proper calculation and subsequent further processing of the DC differential values. Routing a DC address marker from the timing generator to the look-up table to generate a pass-through condition performs this.
The function (quantizer_scale*(non_)intra_quant[m][n]) is computed in the C30, then written to the Scaled Q Matrix Buffer, a high-speed dual-port RAM. More about this calculation will be discussed later. The same address generator as that of the Coefficient Reordering RAM handles control of the output of this buffer, where the inverse zig-zag function was performed. The output of the buffer is directed to a group of two hardware multipliers to implement (above)*(look-up table output). Two separate multipliers are used so that a most significant product is available in parallel with a least significant product without having to implement a double-speed demultiplexer circuit.
The remaining computations of the DQ algorithm are calculated in a Xilinx 4010 FPGA, one per video channel. These functions are divide-by-16 with rounding towards zero, even-odd rounding, and clipping. At this point, the fully dequantized DCT coefficients are sent to an SGA-Thomson IMS A121 DCT/IDCT chip for transformation back to the pixel domain. Figure 3 shows a block diagram of the DQ Board.
C30 Functionality
During power-up, the C30 resets after the Xilinx FPGA chips have auto-programmed. After correctly initializing itself and any other chips, a Reset Acknowledge is written to an external processor register, for transmission to the VLD Board. This is done to prevent the VLD Board from sending erroneous coefficient information to the FIFOs before they are ready, possibly causing a loss of synchronization.
During operation, the processor continually polls an external status register to determine where it is in the video pipeline (Figure 4). The signal "mbstart" is a single-clock cycle pulse from the VLD Board indication a macroblock transfer. This signal is an input to the Xilinx timing generator, that in turn outputs the seven pipeline timing signals "mbsync" through "idct". The first five signals go to the external status register, while "idctgo" is the start signal to the IDCT chip, and "idct" is the output data framing pulse to the Motion Compensation Board. As can be seen from the pipeline diagram, the total board delay is approximately 3.8 macroblock time intervals (about 1200 clock cycles).
C30 computations are split up into two categories. These are DC differential calculations for intra macroblocks only, and quantization matrix updates for intra and non-intra macroblocks. Both are real-time operations, and must be completed within certain time slots in the video pipeline.
The Coefficient Reordering Dual-Port RAMs work by implementing a four macroblock deep circular buffer. Output from the buffer is offset by two macroblocks for pipeline timing reasons. As can be seen from Figure 3, access is provided between the RAMs and the C30, and it is through this path that DC delta values are read. The deltas are read by the C30 using a modulo-64 circular address read, modified to the actual DC value, then written back to the same location. This operation takes place during the first 128 clock cycles of the macroblock transfer (Figure 2). After the DC coefficients have been updated, the inverse zig-zag video stream is read out of the RAMs during the last 192 clock cycles of the macroblock time.
The second macroblock computation to be performed is that of the updated quantizer matrix. This updated matrix, to be used for every new macroblock, is calculated by multiplying the quantizer_scale by the (non_)intra_quant[m][n] values. To save transmitted bits, only the quantizer_scale is sent (if required), up to once per macroblock time. The (non_)intra_quant[m][n] default values are stored in C30 PROM and transferred to internal RAM at power-up initialization time. If a new non-default matrix is to be used, it is transmitted from the VLD Boar into the Q Matrix FIFO. The C30 senses the presence of a new matrix by testing the FIFO empty flag, located in the same external processor status register as the pipeline timing signals. This new matrix is then transferred from the FIFO into internal RAM.
During the latter 192 clock cycles of the macroblock period, the C30 multiplies the quantizer_scale by the (non_)intra_quant[m][n] values, and outputs this new matrix to the Scaled Q Matrix Buffer, another high-speed dual-port RAM located on the expansion bus. The code fragment used to do this is as follows;
MPYI3 *AR1++(1),R1,R2
RPTS 62
MPYI3 *AR1++(1),R1,R2 || STI R2,*AR3++(1)
STI R2,*AR3
This function was critical for real-time operation, and was simulated and verified to run in 135 cycles with the bus structure used in the design. There is also some overhead, including the fact that the C30 must keep track of where AR3 points to, because the output buffer is a ping-pong arrangement. In this way, one 64-element section is being streamed out to the external multipliers three times repetitively, in step with the three blocks of each video channel.
C30 Real-Time Capabilities
As can be seen from the above, the C30 provides useful control and computing capabilities down the decoding hierarchy to the macroblock level, while maintaining real-time speed. It is instructive also to show what it is not capable of doing. By far the most compute-intensive function of the DQ Board is the IDCT function, which must be computed in real-time. A complete description of the IDCT is available from many sources, but an interesting comparison can be made from the descriptions in [2] and [3]. The particular implementation given in [2] lists an IDCT time for the 33 MHz. C30 as 107.9 microseconds. Directly extrapolating to 40 MHz. gives a time of 89 microseconds. To run in real-time with the given system timing parameters, the IDCT must be performed in (8*8)*53 nanoseconds = 3.4 microseconds, off by a factor of approximately 25 times what a single C30 is capable of performing. The particular IDCT chip used, the SGS-Thomson IMS A121 maintains this real-time performance at the expense of a 128-clock cycle delay from input to output. Large pipeline delays such as this are not generally a concern in real-time video processing applications, and it is not here either.
C30 Error Detection
The C30 is able to monitor data transferred from the VLD to the DQ Board by checking data as well as FIFO flags. The two FIFOs that are directly accessed by the C30 are for transferring Q matrix data and macroblock type/sync signals. All of these transfers can be monitored and check for validity because of their relatively low bandwidth. The two data coefficient FIFOs are not directly accessible to the C30, and this was done because the data rate is too high at this point for processor interaction. All four FIFOs do have empty, almost empty, almost full, and full flags routed to the external status register. In this way, the C30 can check for synchronization errors at different times in the video pipeline.
Board-Level Diagnostic Capability
A goal of the board design was that it be relatively straightforward to test and debug, from both hardware and firmware perspectives. By adding only a 12-pin header to the board, the XDS510 emulator was used to provide a serial scan path through the on-board C30, allowing full-speed emulation [4]. This connection allowed a relatively quick diagnostic check at initial power-up, as the C30 has access either directly or indirectly to most of the major functional areas of the board.
VLD Board Description
Overall View
Each of the four DQ Boards receives quantized DCT coefficients and control information from one of four Variable Length Decoder Boards. These boards allow programmable bitstream parsing, and like the DQ Boards, utilize a variety of technologies to achieve real-time performance. As of the time that this paper was written, the VLD Board was still in the design and simulation stages, so a detailed description such as that for the DQ Board is not given here.
The board is being designed around eight different functional areas. These areas are the VLD FIFO, ASIC Group, Sequencer, Coefficient Processor, Control Processor, Clock Generator, Master DSP, and Control DSP. Except for the ASIC Group and the two C30 chips, the other functions are to be implemented using FPGA technology. The ASIC is currently being designed with LSI Logic’s LCA200K family, mainly because of the high speed and large I/O pin count requirements.
As with the DQ Board, available processor bandwidth allows video processing down to the macroblock level. Coefficient-level processing must still be done in external circuitry such as the ASICs or the various FPGAs. Because the number of macroblock-level (and higher) functions is higher for the VLD Board, it was decided to split this functionality into two C30 chips, the Master DSP and the Control DSP.
C30 Functionality
The Master DSP is in a supervisory mode during board power-up and initialization. It must configure various operating parameters in registers and memories for all of the other seven major sections of hardware. After the board is correctly initialized, the first operating task is to monitor the input FIFO, to allow a TBD amount of buffer fullness to occur before release. A potential problem arises in a multiple video stream system such as this one if a section is significantly different in FIFO fullness than the others. For this reason, one of the four VLD Boards is configured as a manager. The C30 serial ports (Master DSP only) are wired together in a four-board ring arrangement, with the manager board releasing all four boards at the most appropriate overall time for the system. This serial port configuration is also used during normal operation as a frame sync test, as well as for general error reporting.
During actual bitstream decoding, the main function of the Master DSP is to keep the flow of information going between the Sequencer and the output FIFOs. The Sequencer is a microcoded state machine that directly controls the ASIC Group (for bitstream parsing). The control interface from the Master DSP to the Sequencer is via a polled register file, while the data flow in the other direction is via a high-speed FIFO.
The VLD Board must not only supply data coefficients, but also control type information as well. This information is sent to both the DQ Board, as well as the Motion Compensation Board, via the Control Processor FPGA. The current data transfer from the C30 to the Control Processor is via 55 32-bit control words. Because of the C30’s pipelined bus writes, the number of processor cycles to do this output operation alone is 3+(54*2) = 111 clock cycles. As was shown before, a macroblock time in this system is 320 clock cycles long. To prevent this from consuming a substantial amount of the Master DSP’s time, as well as providing enough headroom for possible syntax changes in the future, it was decided to add a second C30, the Control DSP. Communication between the two C30s is by way of a high-speed 4Kx32 dual-port RAM, one port to each processor main bus. As in the DQ Board, the dual-bus architecture of the C30 is used to advantage, with the control data being output from the expansion bus.
Conclusion
A programmable HDTV decoder system designed and built for R&D purposes has been described that utilizes the TMS320C30 DSP chip as the heart of two of the major subsystem boards. The capabilities of the C30, such as fast parallel multiplies, high-speed I/O, full 32-bit integer instruction capability, circular address handling, and a full dual external bus structure are used to advantage to allow processing down to the macroblock level of the video hierarchy, while still maintaining substantial flexibility for future enhancements.
References
[1] Committee Draft of Standard ISO11172: "Coding of Moving Pictures and Associated Audio", ISO/MPEG, December 1991.
[2] W. Hohl, "8x8 Discrete Cosine Transform Implementation on the TMS320C25 or the TMS320C30", in Digital Signal Processing Applications with the TMS320 Family – Theory, Algorithms, and Implementations – Volume 3, P. Papamichalis, ed., Texas Instruments, 1990, pp. 169-181.
[3] SGS-Thomson Microelectronics, Image Processing Databook, 1990.
[4] Texas Instruments, TMS320C30 User’s Guide, 1992.