HDTV Motion Vector Decoding With a TMS320C30
By John Wiseman
(Originally presented at the International Conference on Signal Processing Applications and Technology - Dallas, TX - October 1994)
Abstract
This paper examines the issues associated with the implementation of an HDTV motion vector processing stage using a Texas Instruments TMS320C30 DSP chip as the main processing element. This device is part of a Variable Length Decoder board that is in turn a portion of an HDTV video decoder prototype. This system was designed and developed by Panasonic Advanced Television & Video Laboratories (ATVL) of Burlington, N.J. as part of an ongoing HDTV R&D program.System Description
The ATVL HDTV decoder system has a video clock of 74.25 MHz. To allow a reasonable hardware implementation, the video is divided into 4 equally sized vertical strips. In this manner, 4 mostly-independent hardware sections can process 4 video data streams in parallel at approximately 18.6 MHz. It should be noted that with this timing, 1 section alone could support decoding of digital NTSC video signals. Each section is fed a demultiplexed video stream by a Deformat/Router board, and a Bitstream Computer Interface board in turn feeds this. This board, combined with a PC on the input, simulates a digital receiver output and sends a raw digital bitstream to the decoder at rates up to 18 Mb/sec. Decompressed video from the 4 sections is recombined into 1 composite signal by the Section-to-Raster Converter board, then sent to the D/A board before being displayed on an HDTV monitor.
Each of the 4 video decoder sections consists of a Variable Length Decoder board, a Dequantization board and a Motion Compensation board. The Dequantization board is based on a C30 DSP chip, and the design is described in [1] and [2]. The Variable Length Decoder board contains 2 C30 chips, 1 "master" processor for syntax decoding and system control purposes, and 1 "communication" processor, responsible for formatting and transmitting control and motion vector information to external boards. As there are 4 video sections, the system contains a total of 12 C30 DSP chips.
When this project was started, a standard for video compression/decompression had not been chosen by the FCC for HDTV broadcasting in the U.S. As such, an expanded MPEG I syntax was developed by ATVL allowing extensions such as frame/field modes, multiple video sections, etc. This hybrid syntax is followed in the design of the system described here.
C30-Based Design Advantages
Use of the C30 in a non-traditional application such as this was accomplished successfully by ATVL in the design of the Dequantization. This experience was utilized in the Variable Length Decoder board design to fulfill project goals of programmability with minimal amounts of hardware. Some of the advantages of using the C30 DSP chip in this application are:·
Relative ease of hardware design.·
Dual independent bus structure.·
Dual internal RAM banks.·
2- and 3-operand instructions.·
Parallel instructions.·
Indirect addressing with indexing.·
DMA I/O.
Hardware Overview
The hardware that runs the motion vector decoding routines described in this paper is fairly simple. It consists of a Texas Instruments TMS320C30 DSP chip whose main bus is connected to a 4Kx32 dual-port RAM, with an additional 8Kx32 SRAM for future expansion (currently not used). The expansion bus is connected to a 4Kx32 FIFO that the C30 uses to read data from the incoming bitstream, as well as a 4Kx18 FIFO that is used as an output interface to the Dequantization and Motion Compensation boards.
The total size of the program, including motion vector look-up tables, is 3.8K words written entirely in assembly language for speed. After the processor boot operation, the look-up tables (1K) are transferred from the dual-port memory to internal RAM bank 0. The program for buffer and variable storage uses Internal RAM bank 1. In this manner, full speed is maintained without instruction access clashes.
The board clock is one-half of the system clock, or 37.125 MHz. Because the motion vector processor is isolated from the rest of the system with FIFOs, it is allowed to run asynchronously and has a separate 40 MHz. processor clock for improved real-time performance, especially at the macroblock level.
At system power-up, the "master" C30 (part of the syntax processor) downloads the program into the dual-port RAM. Part of this dual-port RAM is also reserved for use as a processor communication "mailbox" where syntax command information is received and read from the syntax processor. In this architecture, all layer commands (process slice, process macroblock, etc.) are sent via the mailbox.
Upper-Level Processing
The following upper-level (above macroblock) functions are serviced by the program, with a brief description of the events performed:
·
Sequence No processing at this layer.·
Group of Pictures No processing at this layer.·
I Frame Output I Frame code to DQ/MC boards.·
P Frame Input fwd_full_pixel_vector_flag, fwd_f_code_h, fwd_f_code_v. Output P Frame code to DQ/MC boards.·
B Frame Input fwd_full_pixel_vector_flag, fwd_f_code_h, fwd_f_code_v, bwd_full_pixel_vector_flag, bwd_f_code_h, bwd_f_code_v. Output B Frame code to DQ/MC boards.·
I Slice Input Q_Scale.·
P Slice Input Q_Scale. Clear motion vector differential buffers.·
B Slice Input Q_Scale. Clear motion vector differential buffers.
Macroblock Processing
The following macroblock types are processed by the algorithm:·
P_skipped_MB·
B_skipped_MB·
intra_xxDCT·
fwd_fr_cpb·
fwd_xxMo_xxDCT·
bwd_xxMo_xxDCT·
bid_xxMo_xxDCT·
fwd_xx_skipped·
bwd_xx_skipped·
bid_xx_skipped
In the above macroblock descriptions, fwd is forward, bwd is backward, bid is bi-directional, cpb is coded block pattern, Mo is motion compensation format, and xx is either fr (frame) or fi (field). As can be seen by the xx labels, enhancements have been made to the basic MPEG I syntax to allow for interlaced video processing with the field-mode extensions for motion compensation and DCT processing.
Macroblock Processing Example
There are too many different macroblock-processing routines to describe in this paper, however it is worthwhile to look in detail at the routine that results in the most critical real-time performance, bid_fiMo_xxDCT (bi-directional, field-mode motion compensation).
In the ATVL HDTV system, field-mode motion compensation results in the bitstream containing 16 motion vectors per macroblock. The motion vectors are organized in groups of 4 for each 8x8 luminance block, such that each block contains the elements for forward horizontal, forward vertical, backward horizontal and backward vertical motion. Also, each block has 2 associated motion reference bits (1 each for forward and backward motion) that indicate which field the motion is referenced to. This is considerably more information to be input and processed in a macroblock time increment than in frame motion compensation mode which only inputs 1 group of 4 motion vector elements.
There are 5 distinct sections to the bid_fiMo_xxDCT macroblock processing routine. There are motion vector and motion reference input, DQ/MC board control header generation, across-block motion vector differential calculation and reordering, horizontal motion vector formatting, vertical motion vector formatting, and output via DMA. There are discussed individually as follows.
The raw bitstream vector information is preprocessed by the syntax processing sequencer (external hardware) look-up table to generate vectors in MPEG format. These MPEG vectors (and associated motion references) are received in the following order:
FHD0 forward horizontal, block D0
FVD0 forward vertical, block D0
FVMRD0 forward vertical motion reference, block D0
FHD1 forward horizontal, block D1
FVD1 forward vertical, block D1
FVMRD1 forward vertical motion reference, block D1
FHD2 forward horizontal, block D2
FVD2 forward vertical, block D2
FVMRD2 forward vertical motion reference, block D2
FHD3 forward horizontal, block D3
FVD3 forward vertical, block D3
FVMRD3 forward vertical motion reference, block D3
BHD0 backward horizontal, block D0
BVD0 backward vertical, block D0
BVMRD0 backward vertical motion reference, block D0
BHD1 backward horizontal, block D1
BVD1 backward vertical, block D1
BVMRD1 backward vertical motion reference, block D1
BHD2 backward horizontal, block D2
BVD2 backward vertical, block D2
BVMRD2 backward vertical motion reference, block D2
BHD3 backward horizontal, block D3
BVD3 backward vertical, block D3
BVMRD3 backward vertical motion reference, block D3
This ordering is an extension of the way MPEG I defines ordering for frame motion compensation. The ATVL HDTV motion compensation board requires a reordering of these vectors so that the blocks are output in the order D0, D2, D1 and D3. As such, the desired ordering in the internal memory buffer for the motion reference signals is the following:
FVMRD0
BVMRD0
FVMRD2
BVMRD2
FVMRD1
BVMRD1
FVMRD3
BVMRD3
This reordering is accomplished by judicious use of parallel assembly language instructions as detailed below:
AND3 *AR2,R0,R2 || STI R2,*AR4++(IR1)
AND3 *AR2,R0,R2 || STI R2,*++AR6
ASH3 R1,*AR2,R2 || STI R2,*+AR6(IR0)
Etc
Where AR2 points to the input FIFO on the expansion bus, AR4 points to the motion reference buffer (internal RAM) and AR6 points to the motion vector buffer (internal RAM). AND instructions are used for input for masking purposes and ASH instructions are used to format the motion reference signals in a manner to be explained later. By carefully changing the index to AR4, the motion reference signals can be stored in the desired order shown above in 1 input step with no pipeline breaks. The ordering of the motion vectors after this step is close to the desire, with the ordering as follows:
FHD0
FHD1
FHD2
FHD3
FVD0
FVD1
FVD2
FVD3
BHD0
BHD1
BHD2
BHD3
BVD0
BVD1
BVD2
BVD3
Final reordering into ATVL format is then accomplished by again using indexed addressing during the differential calculation.
The next stage in the algorithm places the correct 7-word header into the output buffer. This may seem like a rather arbitrary place for this to occur, but it was chosen to minimize pipeline conflicts with other stages.
Now that the MPEG format motion vectors are in the input buffer, an across block differential calculation must be performed for each of the 4 elements, forward horizontal, forward vertical, backward horizontal, and backward vertical. Again, to minimize cycles, 2 operations are performed in this step that are residual from the input stage. First, before the differential value is added to the previous vector, the value is constrained in range via the appropriate f_code, input and formatted at the frame layer. This f_code results in values held in internal memory buffers that when used as shown below, results in the properly scaled vector differential.
ASH @FCDH_A,R3
ASH @FCDHN_A,R3
Where address FCDH_A contains a left arithmetic shift value pre-computed at the frame layer, and FCDHN_A contains the negative value, resulting in a right arithmetic shift. These instructions also perform the necessary operation of sign extension on the under 32-bit data words. Again, parallel instructions are used whenever possible, and the index register is again exploited within the ADDI3 || STI instructions to store the updated motion vectors in the ATVL block order discussed previously.
The next 2 stages are for horizontal and vertical motion vector post-processing. These functions are split up because of the different nature of the processing tasks, and the need to maintain an efficient pipeline for execution speed. In the case of the horizontal motion vectors, the luminance vectors are already in the correct format for the motion compensation board to process, and are left in the previously defined memory buffer. Chrominance vectors are calculated by running the luminance vectors through a look-up table located in internal RAM, then again using parallel instructions with address indexing to place the resulting vectors into the buffer in the correct order. Also in this stage is an offset calculation resulting from the fact that the ATVL macroblock origin is referenced from the upper left-hand corner (unlike MPEG I where it is reference from the center point). This is simply an addition of 8 pixel units (16 half-pixel units for half-pixel mode) for the appropriate luminance block vectors and an addition of 4 pixel units (8 half-pixel units for half-pixel mode) for the appropriate chrominance block vectors.
Vertical motion vectors in field motion compensation mode are calculated in the opposite manner from that of the horizontal vectors. In this stage, the chrominance values are copied from the existing luminance buffer values, while the new luminance vectors are processed with a look-up table located in internal RAM. The vectors in this look-up table are formatted so that there is a hole at the 2nd bit location (the LSB stays, and the rest of the word is shifted left 1 bit). After the vectors are received from the look-up table, the appropriate previously added and stored motion reference signal is added to the hole location. In this manner, the resulting vertical luminance motion vector to be sent to the motion compensation board contains the half-pixel flag at bit 0 (LSB), the motion reference signal indication the appropriate reference field at bit 1, and the motion vector itself starting at bit 2. As used extensively in the previously described stages, parallel instructions with indexed stores are utilized to complete the data in the output buffer for transfer via DMA to the external boards.
The final stage of this routine is a DMA output for the 55 data words (7 header control words, 16 luminance motion vectors, and 32 chrominance vectors) generated every macroblock time increment by the above (or any of several other) macroblock processing routines. Since the example routine given is extremely long relative to the macroblock time increment (17.24 microseconds), it is possible that a proceeding routine of very short duration (intra for example) might be overwriting the buffer while the DMA is still trying to read it. Because of this, a simple ping-pong arrangement was implemented for the output buffer.
Instruction Utilization
A PERL program was written to allow a histogram analysis of each routine, showing the C30 assembly instructions used along with the frequency of occurrence. The histogram for the bid_fiMo_xxDCT routine is shown below:
Lines Containing Instructions = 190
Instruction Count Percent
ADDI 16 8.4
ADDI3 16 8.4
ADDI3 || STI 23 12.1
AND3 1 0.5
AND3 || STI 15 7.9
ASH 32 16.8
ASH3 1 0.5
ASH3 || STI 15 17.9
BRD 1 0.5
LDI 25 13.2
LDI || LDI 2 1.1
LDI || STI 14 7.4
MPYI 1 0.5
OR3 1 0.5
STI 21 11.1
STI || LDI 1 0.5
STI || STI 4 2.1
XOR 1 0.5
Current Total = 190
As can be seen from the histogram, 74 out of 190 instructions (39%) are parallel instructions resulting in greater throughput. This can be compared to the histogram for the bid_fr_Mo_xxDCT (frame motion compensation mode) as shown below:
Lines Containing Instructions = 124
Instruction Count Percent
ADDI 5 3.0
ADDI3 8 4.8
ADDI3 || STI 21 12.7
AND3 5 3.0
AND3 || STI 45 27.1
ASH 20 12.0
BNZ 4 2.4
BRD 1 0.6
CMPI 4 2.4
LDI 25 15.1
LDI || STI 8 4.8
MPYI 1 0.6
OR 1 0.6
OR3 1 0.6
OR3 || STI 1 0.6
RPTB 1 0.6
STI 7 4.2
STI || STI 8 4.8
Count Total = 166
Note that the number of Lines Containing Instructions may or may not equal the Count Total. This is because the histogram generating program counts what is actually executed after RPTS and RPTB (repeat single and repeat block) instructions if any are used in the routine. Because of the more straightforward nature of the motion vector algorithm, parallel instructions ere able to be used in 83 out of 166 instructions, resulting in a 50% utilization.
Potential Speed Improvements
The use of parallel instructions as discussed above certainly helps to speed up the execution fo the motion vector computations, if certain potential pitfalls are avoided. As was shown in the instruction histograms, the instructions ADDI3 || STI, AND3 || STI and ASH3 || STI are used quite frequently. These instructions all follow the generalized parallel instruction syntax as detailed for a parallel ADD and STI [3]:
ADDI3 src2,src1,dst1 || STI src3,dst2
Where:
·
src1 register (R0 R7)·
src2 indirect (disp = 0, 1, IR0, IR1)·
dst1 register (R0 R7)·
src3 register (R0 R7)·
dst2 register (disp = 0, 1, IR0, IR1)
As can be seen, indirect addressing with displacement is somewhat limited in the parallel instruction format, with the index limited to the constants 0 and 1, or the contents of either IR0 or IR1. Note that there are not 8 index registers, as there are for the general registers R0 R7 and the auxiliary registers AR0 AR7. Because of this it is difficult to do non-standard types of reordering while using these instructions. Compounding the problem is the fact that the index registers (IR0 and IR1) cannot be modified immediately before either is to be used in one of the subsequent parallel instructions without resulting in a 2-cycle delay penalty. Because of this, reordering was spread between computation stages as described previously, while taking full advantage of the fact that the indirect addresses can be generated by using successive combinations of pre- and post-increment and pre- and post-decrement, with modification or without. The disadvantage to writing assembly language in this style is that it becomes extremely hard to read and modify later if necessary.
Pipeline delays also manifest themselves in the routines that utilize internal RAM look-up tables for the motion vector processing. A basic code fragment for this is shown as follows:
ADDI3 *AR7,R6,AR0
LDI *AR0,R1 || STI R1,*AR2++(IR0)
Where AR7 points to the original motion vector, R6 contains the address of the zero point of the look-up table, and AR0 contains the new motion vector address in the look-up table. This also results in a 2-cycle delay due to the nature of the C30 pipeline. This problem can be minimized by stuffing extraneous instructions that do not utilize addressing registers between the 2 instructions, if feasible. In the ATVL code, this was not always possible, and resulted in some loss of speed. This problem can be minimized by using a C40, where the problem manifests itself only when the modified register is identical to the following instructions address generation register. The above code may then be modified so that AR0 is effectively alternated with AR1, with the instructions repipelined to accommodate this.
Another advantage to using the C40 in this application would be the fact that it is currently available in a 50 MHz. version, thus improving real-time performance by another 25% over the currently used 40 MHz. C30.
Conclusion
A C30 DSP-based design has been implemented that processes HDTV motion vectors in a real-time system. This somewhat atypical use of a DSP chip demonstrates that the C30 can function in an application such as real-time video processing as a very high performance microprocessor if certain guidelines and restrictions are carefully followed, both in the high-level system design as well as in the low-level hardware/software implementation.
Reference
[1] J.R. Wiseman, "Use of the TI TMS320C30 in the Design of a Programmable HDTV Video Decoder", in Proc. ICSPAT, vol. 1, pp 1-8, San Jose (October 1993).
[2] J.R. Wiseman, "A Programmable Video Dequantizer for HDTV", in Proc. DSPX, pp 225-232, San Jose (October 1993).
[3] Texas Instruments, TMS320C30 Users Guide, 1992.