A Complete DSP Design Example Using FIR Filters
by John Wiseman
(Originally published in QEX magazine, July 1996)
Introduction
The purpose of this article is to illustrate a design flow example for a Digital Signal Processing (DSP) algorithm, from basic conception through real-time implementation. The algorithm that was chosen to be illustrated in this article is the Finite Impulse Response (FIR) filter. This class of digital filter is capable of excellent performance in amateur radio applications, is fairly easy to implement in a DSP chip, and its implementation utilizes a number of high performance features unique to DSP architectures. By going into low-level detail, insights are provided into important aspects of DSP design and implementation. These issues are pertinent not only to FIR filters, but to other more advanced algorithms as well, such as adaptive filtering, frequency domain processing, demodulation, and other functions that amateur radio experimenters will find of interest. Areas that will be discussed include simulation of algorithm performance before implementation, real-time performance estimation, floating point vs. fixed point DSP chip architectures, and how certain unique features of DSP chips can be effectively utilized.
FIR Filter Design
The implementation equation for a Finite Impulse Response (FIR) filter is as follows:
, (1)
where x(n-m) are sampled data inputs, y(n) is the filtered output, and c(m) are the filter coefficients, or tap values, that together represent the impulse response. It is this equation that we seek to first find appropriate values of c for, and then calculate in real-time to produce an output y. This equation may be compared to that of an Infinite Impulse Response (IIR) filter which is of the form:
.
(2)
Because of the feedback terms on the right, the IIR filter contains both poles and zeroes in the transfer function. An FIR filter has only zeroes in its transfer function, as there is no feedback. As such, the IIR filter is more susceptible to instabilities due to quantization of filter coefficients, an issue that we will discuss in further detail later. Another important issue is that linear phase can be guaranteed for an FIR filter by constraining the coefficients to be symmetric in the implementation. The proof of this is beyond the scope of this article, but is generally given in signal processing textbooks dealing with FIR filter theory1,2. Because of their inherent stability and linear phase characteristics, as well as the fact that they are relatively easy to derive coefficients for, FIR filters provide good first-time DSP design experience. The main drawback to using FIR filters is that in general they will require considerably more coefficients and computations to provide the equivalent amount of magnitude attenuation compared to an IIR filter.
One of the easiest and most straightforward ways of deriving the coefficients for an FIR filter is to use what is known as the Fourier series method. This method takes advantage of the fact that the frequency response of the filter is really a periodic function with the same period as that of the sampling frequency used to sample the analog input signal. See Figure 1A for clarification of this point. Because of this periodicity of the frequency response "waveform" we can derive the Fourier series coefficients, and then use them as our FIR filter coefficient values. The equation for deriving these coefficients is:
,
(3)
where n is the coefficient number and H(n) is the desired frequency response. In our case, the frequency response is desired to be of magnitude one in the passband and magnitude zero outside of the passband so this equation reduces to:
(4)
for an even function lowpass filter. Note that fc represents the desired filter cutoff frequency expressed as a fraction of the sampling frequency fs. Because of this, its range will be 0 to 0.5, as this represents the range of frequencies under the Nyquist limit (1.0 is the sampling rate itself). A note of caution is warranted here. Some literature utilizes the fractional notation as it is used in this article, where 0.5 is the Nyquist limit and 1.0 is the sampling frequency, while other sources have been known to define their Nyquist limit at 1.0 and the sampling frequency at 2.0. Other sources may define these values in angular frequency terms, as p and 2p respectively.
As an example, a quick design for a 7-tap FIR filter with fc = 0.125 (i.e. 900 Hz. if fs = 7200 Hz.) may be done with a hand calculator. Because of coefficient symmetry about zero, we only have to calculate the following:
c3 = 0.07503 c2 = 0.15915 c1 = 0.22508.
By using symmetry:
c-3 = 0.07503 c-2 = 0.15915 c-1 = 0.22508.
The "zero" coefficient cannot be directly calculated as it results in 0/0, but elementary calculus (L Hopitals rule) tells us that the result will be 2fc itself, so
c0 = 0.25000.
From this basic result for a lowpass filter, it is interesting to see how to obtain other filters as well. A highpass filter with the same cutoff frequency can be obtained from the lowpass coefficients by using the following transformation:
.
(5)
Using the lowpass coefficients from the previous example gives the following highpass coefficients:
c-3 = -0.07503 c-2 = 0.15915 c-1 = -0.22508 c0 = 0.25000 c1 = -0.22508 c2 = 0.15915
c3 = -0.07503.
I tend to use bandpass filters the most, and they are easily designed by changing the limits of integration on the formula for the Fourier series coefficients. An example with the lower passband edge at fcl and the upper passband edge at fcu (Figure 1B) is:
.
(6)
Computer Generated Coefficients
The formulas for generating the FIR filter coefficients are fairly simple and easy to use, but impractical to use directly for more than a few high precision values. This job can be made almost trivial by writing a software program that can calculate the coefficients for arbitrary cutoff frequencies and arbitrary numbers of filter coefficients. Appendix A contains a listing for a C program that I have written to perform this function. Included in the program is a function that also produces the frequency response for the newly designed filter for performance verification before implementation. These values can be inspected visually, or imported into a plot program for more detailed analysis.
As an example of this capability, Figure 2 shows the output for three different FIR filter designs. The filters were designed to have a lowpass cutoff frequency of 900 Hz., which corresponds to a factor of 0.125 if the sampling frequency is set at 7200 Hz. The filters were designed with 7, 31, and 127 coefficients, and one can see that the desired frequency response is more closely approximated by using more coefficients as would be expected from Fourier theory.
As was previously stated, the Fourier series method is easy and straightforward to implement, but it is somewhat limited in its capabilities. Other mathematical methods are often used to derive the coefficients such as maximally flat approximations, least-squared error designs, and minimax error designs. A powerful example of the latter technique is the Remez exchange algorithm, which can be used to design filters to an arbitrary set of specifications with the minimum resulting number of filter coefficients. Variations of these algorithms are provided on supplemental disks to various DSP textbooks3,4,5 for those experimenters who do not desire to write their own coefficient generating software.
Window Functions
The coefficients for a lowpass FIR filter of 127 taps were generated and are plotted in Figure 3. It can be seen that the coefficients do not smoothly approach zero, but instead are abruptly truncated on either end of the sequence. It is this abruptness that causes the ringing in the passband known as the Gibbs phenomenon2, and the relatively poor sideband attenuation for filters designed with this method. An example of how the frequency response is affected is given in the next section. Figure 4 shows the same filter coefficients multiplied by the values for a Hamming window. Notice that the coefficient values approach zero in a much smoother fashion. Windowing is an important concept, and it is useful in frequency domain processing as well, such as reducing spectral "leakage" in discrete Fourier transforms.
The window coefficients that are used with the examples in this article are from the formula for the generalized Hamming window:
.
(7)
When the constant a = 0.5, then the window is known as a Hanning window. This type of window is also seen in DSP literature as a Hann or von Hann window, named after its creator. If a = 0.54, then it is known as a Hamming window. As a compliment to the 7-coefficient lowpass filter example done previously, a 7-coefficient Hamming window sequence is generated with a hand calculator, with N = (7-1)/2 = 3. As such,
w3 = 0.08000 w2 = 0.31000 w1 = 0.77000 w0 = 1.00000.
Because the function is symmetric (try the calculation for n = -3), the remaining values are:
w-3 = 0.08000 w-2 = 0.31000 w-1 = 0.77000.
A new set of windowed coefficients is now created by multiplying the previously generated FIR coefficients with these windowing values on a pairwise basis like this:
c-3 x w-3 , c-2 x w-2 , ... , c3 x w3.
The FIR filter design C program generates four different windows. These are triangular, Hanning, Hamming, and Blackman, and they are individually selectable when running the program. If a "0" is entered for this field, then the default is no window, which is actually a rectangular function. This is by no means the complete list of window functions, and others such as the Kaiser window can be fairly easily added to the program. Also, it is certainly a trade-off as to what window function should be used for a particular filtering problem. Experimentation will help show you what parameters work best for a given level of interference, your own personal listening tastes, what particular DSP chip architecture the filter will be implemented on, etc.
Simulated Filter Performance
Figure 5 is a plot of the output data from the C program that simulates the performance of an FIR bandpass filter with 127 coefficients. In this case, the lower cut-off frequency is 0.2 (7200 x 0.2 = 1440 Hz.) and the upper cut-off frequency is 0.3 (7200 x 0.3 = 2160 Hz.). No windowing function is used. As can be seen from the plot, there is significant ringing in the passband due to the Gibbs phenomenon. Another problem with this filter design is that the sidelobes are not attenuated much relative to the passband. These problems can be helped significantly with the usage of windowing functions as was discussed previously.
Figure 6 is a plot of the same filter design, this time utilizing a Hanning window. One can easily see that the passband ringing is virtually gone, and the sidelobe attenuation is considerably better. The trade-off condition can also be seen, and that is that the passband edges are slightly less steep than for the non-windowed filter.
Figure 7 is also a plot of the original filter design, but a Hamming window is used in this case. Comparing this plot directly with Figure 6 shows that a fairly small change in the windowing function can have significant effects on the resulting filter frequency response. In this case, the first sidelobes outside of the passband are attenuated by about 10 dB more than with the usage of a Hanning window. Again a price is to be paid, and in this case the attenuation of the extended sidelobes is not nearly as good, approximately 65 dB compared with 120 dB. Passband roll-off at the edges is comparable between the two windowing functions.
The interesting thing about these plots is that the data is calculated with double precision floating point arithmetic in the C program. The move to single precision floating point implementation with the C30 will not make a significant difference, but using the C50 with 16-bit precision filter coefficients will definitely result in a deviance from the simulated frequency response performance. I have added a function to the C program that will calculate the deviated frequency response for 16-bit coefficients, and a plot of this for the Hanning window case is shown in Figure 8. Compare this result to Figure 6 and note the difference. Since the quantization noise of a 16-bit value is approximately 96 dB, it is not reasonable to expect filter performance of 120 dB, unless the programmer can tolerate running slower and using more memory by using a 32-bit double precision scheme. Figure 9 shows the deviation for a 16-bit version with a Hamming window. The differences between this plot and the corresponding Figure 7 are much smaller, as the filter specifications are not nearly as dramatic and they fit better into the available dynamic range.
This is part of the process that the DSP designer must do to optimize the two variables of performance and price. In general, the cost of the DSP chips will rise as the word widths increase, and the cost of analog-to-digital and digital-to-analog converters will follow the same trend. In a high volume product the cost differences between a 16-bit, a 24-bit, or even a floating point DSP chip may force a compromise in performance in the final product. This need not be the case for the amateur radio experimenter, however, as the cost of DSP starter kits is relatively low for all of these various architectures as is discussed in the next section.
DSP Hardware Platform
Now that we know how to generate coefficients for an FIR filter that satisfy particular requirements, we are ready to implement the design in a DSP chip. The two different chips that I use in this article are the Texas Instruments TMS320C30 and the TMS320C50. These two popular DSP chips represent well a floating point architecture (C30) and a 16-bit fixed point architecture (C50). Both chips are available on internal PC half-cards that contain fast single-cycle static RAM external to the DSP chip, as well as a PC interface, and an Analog Interface Circuit (AIC) that is a single chip containing a 14-bit analog-to-digital converter for the incoming signal, a 14-bit digital-to-analog converter for the processed outgoing signal, anti-aliasing filters, and a programmable sampling frequency. This particular configuration is known as a C30 or C50 Evaluation Module, or EVM for short.
The C50 is also available as part of a starter kit called the DSK. This board has the C50 DSP chip and the AIC, but is a stand-alone board that connects to a PC with a serial cable. Also, no external RAM is provided, but the internal RAM of the C50 should be more than enough for most amateur radio experiments. This board is available from Texas Instruments for $99 and includes a code assembler and loader. The software tools are slightly more limited than the full set available with the EVM boards (no linker is provided so absolute addressing is required), but again they will easily suit the filtering example given here, as well as much more complicated algorithms that you might develop on your own. It is meant to compete with starter kits available from other manufacturers such as Motorola for their 56002 and Analog Devices for their 2101. Just as I was writing this article, I came across an announcement from Texas Instruments that a single bus version of the C30, the C31, would also be available in a DSK starter kit for $99. Although all of these DSP boards are useful to perform the filtering tasks described here, some architectures may provide you with features either in hardware or software that you might find more appealing to work with. It is suggested that you check the various manufacturers data sheets on these boards and the individual DSP chips themselves before settling on a platform to experiment with yourself, as you might find features such as the 16-bit ADC/DAC on the Motorola board to be more beneficial to your algorithms.
Real-Time Performance
Before attempting to implement any algorithm on a DSP chip, a real-time performance analysis should be done. In this case, the analysis is very straightforward and easy to do. The first thing that is needed is to set the sampling frequency for the analog-to-digital converter within the AIC to a reasonable value. A quick look at the AIC datasheet shows that the minimum sampling rate that can be programmed in with the standard clock is 7.2 KHz. Since it is desirable that the same sampling clock be used for both SSB and CW, the maximum frequency of the incoming analog signal could be limited to 2.8 KHz. by the AIC analog filter and the Nyquist sampling theorem would be satisfied. To prevent aliasing, the analog signal must be sampled at a frequency at least twice that of the highest frequency contained with it. If we band-limit our incoming signal to 2.8 KHz., then 7.2 KHz./2.8 KHz. = 2.5, an acceptable value.
Since the sampling frequency is known, this provides us with the value of time that the DSP chip has in which to calculate the filtered output. In this case, it is 1/7200 = 139 ms. The processor clock speeds for the C30 and C50 EVM boards are 30 and 50 MHz. respectively. Both DSP chips have machine cycles operating at speeds of one-half of the clock, so the C30 runs at a speed of 15 million cycles per second. If the assumption is made that all instructions will execute in one cycle (not always the case) then we can calculate the number of real-time instructions available as 15,000,000/7200. This simple calculation shows that we will have 2083 processor cycles available for our algorithm to successfully complete within the available sampling time, before new data is ready. As will be shown with assembly language examples, this is quite a lot of "horsepower" and we will have plenty of time left over after implementing a sophisticated time domain FIR filter.
By extrapolation we can see that if a very high performance DSP (40 million instructions per second is not unreasonable today) is to be used, a fairly high sampling rate of 40 KHz. would leave 1000 instructions per sampling interval for algorithm processing. This is essentially what is done for low IF digital signal processing today.
Circular Buffering
Before going into specific assembly code examples, it is necessary to understand the concept of circular buffering. From looking at the equation for an FIR filter (Equation 1), it can be seen that M data samples need to be collected so that a sum-of-products operation can be performed with the coefficients, resulting in one output value. For a filter of length 7, 7 input data samples need to be stored. For a filter with 127 coefficients, 127 input data samples need to be stored, etc. In a standard microprocessor, a variable length buffer would have to be set-up in memory and a pointer structure would have to be maintained to keep track of where incoming data was placed to avoid having to shift and move the entire block of previously stored data values. A circular buffer is a DSP chip feature that is designed into the hardware to help with this potentially high overhead function.
Figure 10 shows how circular buffers are used for both the FIR coefficients and the incoming data values. The cn values are place in consecutive memory locations, with lower memory addresses being towards the left in this example. The arrows above the memory locations indicate the position of a memory pointer, with the pointer starting out the calculation from position 1. Likewise, the incoming data values are placed into separate consecutive memory locations, and in the first example xn is the most recently received data value from the analog-to-digital converter. The filter calculation is now performed by multiplying the c-3 value that the coefficient pointer is accessing, times the xn-6 data value that the data pointer is accessing, and placing the value in the accumulator. Now both circular buffers automatically update their respective pointers to the next memory locations indicated by the number 2. These data values are then multiplied together and the results added to the previous results and stored in the accumulator. This process is carried out until all the coefficients and stored data values have been accessed.
At this time, the circular buffers are each updated in a slightly different manner. The coefficient buffer remains constant because no new coefficients are going to be added or changed during the filtering operation. As such, the pointer is set-up to return to position 1 for the next calculation cycle. The data circular buffer is different, however, as a new incoming value must be stored from the ADC. At this point in time, the incoming value xn+1 must replace the oldest value in the buffer, xn-6. But since it is now the most recent data value, it is necessary that it be the last to be accessed from memory. To accommodate this, the starting position of the pointer is then advanced by one position as seen in the second data buffer line. The next filter calculation now starts with the pointers in position 1 of their respective buffers, so that the first multiply/accumulate starts with c-3 and xn-5. When the data buffer pointer reaches the highest position in memory, it then wraps back to the lowest position automatically to access the next value consecutively. This "wrapping around" is why the concept is referred to as a circular buffer. If one was to cut out the memory buffers from Figure 10 with scissors and tape the lowest memory position to the highest, then it would reinforce well the circular nature of what is actually happening.
Circular buffers greatly simplify the handling of both the filter coefficients and the incoming data values to be filtered. Not only do they help to feed the multiply/accumulate logic with data quickly, but they provide an efficient memory structure so that only 2M memory locations need to be dedicated to the storage of coefficients and data values.
C30 Assembly Code Example
Figure 11 gives an assembly language code listing for an FIR filter implementation on the C30 EVM. After reset, an initialization routine is run which leads to the routine labeled as LOOP. A polling loop is at the start of the routine whose purpose is to continually check to see if the Analog Interface Chip (AIC) is ready to accept another data word from the DSP. If not, the device is continually re-polled. Of course this function could also be done with an interrupt service routine, but I chose to show polling as it is easier to follow in the coded examples. If it is ready, an integer data value is read by the DSP and converted to floating point format for further mathematical processing. This floating point data value is then stored in the circular buffer that has been previously set up in internal memory.
Now that data has been converted into the correct format and placed into the circular buffer, we are ready to compute the filter output by implementing the sum-of-products operation with the stored filter coefficients. Again, hardware assist within the DSP chip is utilized to make this operation extremely fast and efficient. The RPTS instruction forces the C30 to execute the next instruction in a repeating loop with no performance penalty. This is advantageous to the developer in two ways. First of all the DSP will be executing only the multiply-add instruction in consecutive cycles, so speed will be maximized. Second, the constant that is called TAPS in the RPTS instruction can be defined at the beginning of the program and easily modified so that the filter length can be changed quickly without having to search through the entire listing. As your DSP programs grow in complexity, this becomes a major benefit in helping with code maintenance.
In analyzing the one instruction that is actually calculating the sum-of-products for the filter, it is apparent that quite a lot of things are happening during this single machine cycle. First of all, the presence of the || symbol indicates that a parallel operation is taking place in the DSP. In other words, an MPYF3 (a floating point multiplication with 3 operands) is taking place at the same time as an ADDF (a floating point addition). It should be noted that not all individual instructions can be combined into parallel instructions. Since the entire instruction word itself is limited to 32 bits, a smaller subset of these bits is available to be allocated to instruction decoding, register allocation, addressing mode, etc. Because of this, a limited number of instructions are allowed to be paralleled, so they are chosen by the chip designers to be those most useful in typical signal processing algorithms. The MPYF3 || ADDF instruction is a good example, as it implements a sum-of-products operation, the heart of an FIR filter calculation.
The first two operands of the MPYF3 portion of the instruction are used to get the data values from the input circular buffer and the coefficient values, both now in internal DSP RAM. The multiplier portion of the circuitry now multiplies these two values together, and places the floating point result into the 3rd operand, R6. At the same time (in parallel), an adder in the ALU adds the floating point values in R6 and R5 with the resulting value placed into, or "accumulated", into R5. Both input operands for the multiply instruction are retrieved by circular address generators. The input data circular buffer is advanced one extra position after every run through the entire routine to make room for an incoming data word, while the coefficient circular buffer is reset back to the starting point of the coefficient table located in RAM. See Figure 10.
It is important to understand that because the instructions are operating in parallel, the ADDF portion of the instruction is operating on data that is one cycle behind in time! The architecture of the C30 is such that R6 is updated by the MPYF3 portion after the current value in R6 is read by the ADDF portion. Because of this, after the desired number of sum-of-products operations is performed, a final ADDF instruction must be tacked on to the end. After this instruction, R5 will contain the correct answer for the filter calculation.
After the final addition is completed, the filter output data must be converted from floating point back to integer form for output to the AIC Digital-to-Analog Converter (DAC). This conversion is accomplished in one cycle with the FIX instruction. Finally, the lower 2 bits of the 16-bit data word must be set to zero as the AIC accepts as data only the upper 14 bits, while the lower two bits are used during the initial programming and set-up phase.
Now that the calculation is complete and the filtered data word has been output from the DSP to the outside world, the routine executes a branch back to begin polling for the availability of more input data. The branch execution is enabled by the BD instruction, whose mnemonic is short for Branch Delayed. Under normal branching operation there is a three instruction delay (two instruction delay in the C50) in performing the actual branch after the instruction is read. If this branch were the last instruction in the routine, there would effectively be a waste of three instruction cycles because of internal processor pipeline delays. The delayed branch helps the programmer utilize this time by potentially allowing one to fill these gaps with useful instructions. In the routine shown, the conversion to integer and the mask and store operations are performed just prior to when the branch back to LOOP is actually implemented. The only caveat to using delayed branches is that during a conditional test the programmer must be careful, because the proceeding three instructions will be executed regardless of whether or not the branch is actually taken. This technique is utilized during the initial polling of the transmit bit at the beginning of the program.
The second portion of the assembly language program is labeled as COEFF, and it contains the actual filter coefficients as generated from the C program. For convenience, the C program prints the floating point coefficients into a separate file along with the .word assembler directive. The .asect directive tells the assembler that the data are to be assembled into the program at absolute address 0100h, chosen somewhat arbitrarily. To change filters in this basic set-up, generate a coefficient file for the filter parameters that are desired, then include the data values into the assembly language file at the COEFF entry point. If the number of coefficients has changed, modify the value of TAPS appropriately and then reassemble the code.
C50 Assembly Code Example
Figure 12 shows an assembly code listing for the C50 EVM that performs the same filtering function as the C30 code analyzed above. Without studying the code in detail, it should be apparent that there are some interesting differences between the two. First, the instruction mnemonics are not the same even though the two DSP chips are designed and produced by the same manufacturer. Second, there are no parallel instructions in the C50 architecture. And third, usage of indirect addressing pointers in the C50 assembly language is not as straightforward as in the C30.
Again, a simple polling scheme is implemented as the starting point of the FIR filtering routine. When the check is successful and the routine has been entered, data is input into the DSP and placed in the lower 16 bits of the accumulator. Note the difference in architecture here. The C50 has an "accumulator" register dedicated to that purpose alone, while the C30 could use any general purpose register with the addition function to effectively implement an accumulator. The SACL instruction then takes this data and stores it into RAM in a pre-defined circular buffer. Again, an architectural difference between the DSP chips is apparent in that the incoming data in the C50 must be routed through the accumulator before being placed in the RAM circular buffer. The C30 on the other hand can use a parallel instruction along with any general purpose register to increase efficiency.
Like the C30, the C50 has a hardware assisted "repeat next instruction" feature that is utilized to maintain single cycle execution of the time critical sum-of-products calculation, this time accomplished with a MAC instruction. The MAC COEFF, *+ instruction is short for Multiply/Accumulate, which allows the output of the multiplier to be routed directly into the accumulator in the same machine cycle. The two inputs are defined in the instruction to be COEFF, which is my label for the start of the filter coefficient data table in RAM, and the data circular buffer that is pointed to by the value contained in register AR6. This instruction is specially designed for repeating, as once the starting points in memory for the two operands are loaded, they are automatically incremented to the next value by special purpose addressing hardware. The plus (+) sign indicates that the starting value of the circular buffer is advanced one position for the next run through the routine. As was the case in the C30, the multiply/accumulate operation is registered (pipelined) so that the value in the accumulator is actually one cycle behind what the multiplier is presently calculating. As such, a final accumulator cycle needs to be performed. The C50 has a special instruction to do this called APAC, which automatically uses the correct operands for the update.
In the C30 filter coding example, the fact that floating point arithmetic is used in the sum-of-products calculation means that the answer appeared in a DSP register, then all that was necessary was to use a conversion instruction (FIX) before storing the new integer result in the AIC DAC. In the C50, however, things are not quite as simple due to the fact that the architecture is based on 16-bit fixed point words. Notice that in this example, no data conversion is necessary during data input because we are reading a twos complement integer from the AIC Analog-to-Digital Converter (ADC). Likewise, no conversion will be necessary at the output to the AIC DAC once the lower 2 bits are masked to zero. The price that we pay is that we must keep track of the decimal point in the accumulator. This is fairly simple in this application, but more sophisticated algorithms can result in this function becoming very complicated!
The fixed-point multiplier in the C50 multiplies 2 16-bit words together, producing a 32-bit result. If the 2 input words are unsigned, then the output is also unsigned. In this example, the input data from the AIC ADC are 16-bit twos complement words (14-bit resolution with the lower two bits set to zero) and the stored FIR filter coefficients are 16-bit twos complement words as well. Because of the way that binary numbers are multiplied together, the output of a multiplication of 2 twos complement numbers is another twos complement number, but with an additional sign bit on the left. The accumulator in the C50 can be set up to automatically compensate for this extra sign bit by shifting the word left one bit. Again, hardware assist comes in handy because without this an extra instruction would be necessary to shift the accumulator value by one bit.
After the sum-of-products calculation is compete, the answer is contained in the accumulator as a twos complement 32-bit number. Since we are only interested in transferring 16 bits to the AIC, the obvious thing to do is to write the upper half of the accumulator out since it is this half that contains the most significant bits of data, as well as the sign bit. The C50 provides instructions to Store Accumulator High word (SACH used in this code) and Store Accumulator Low word (SACL). Since the lower accumulator represents "extended resolution" bits in this filtering example, they can effectively be ignored. As with the C30 example, the lower 2 bits of the output word are masked to zero for compatibility with the AIC data format. A delayed branch (2 instruction delay) then returns the program back to the initial polling loop.
The second portion of the assembly routine is similar to the one that is used for the C30 code, except that the data is not in floating point format, but 16-bit hex format instead. This conversion is accomplished with a function in the C program, and rounding is provided for greater accuracy in the results. Again, changing these values along with the value of TAPS (if required) will change the filter performance after re-assembly.
Comparison of DSP Functionality
Several things are apparent when comparing the two assembly language routines on the C30 and the C50. Special purpose hardware is designed into both DSP chips to help make typical signal processing operations more efficient like circular buffering, sum-of-product calculations that can be performed in consecutive cycles by utilizing repeated multiply-addition type operations, and delayed branch instructions that can be used to efficiently handle both non-conditional and conditional program branches. Because the examples given in this paper were somewhat limited in scope, other DSP-specific hardware features were not touched. One major feature is the automatic bit-reversal addressing capability, available on both the C30 and the C50. This feature helps reduce the programming overhead associated with the coding of Fast Fourier Transform (FFT) algorithms, and could be quite useful for real-time filtering in the frequency domain as opposed to the time domain demonstrated here.
It is also fairly obvious that these DSP chips, even though they are designed and built by the same manufacturer, handle solving the same problems in different ways. It was shown in the two coding examples that floating point arithmetic alleviates the need for worrying about accumulator overflow or scaling, but necessitates a conversion to and from integer form for I/O purposes. Parallel instructions available on the C30 can help with overall efficiency and speed, but at the expense of forcing the programmer to think in terms of pipelining, and not all algorithms may be able to take advantage of this fully. Because the C50 is dealing with an instruction word that is only 16 bits wide compared to 32 bits for the C30, it is more limited in its usage of opcodes for features such as indirect addressing. If the programmer tends to use indirect addressing frequently (like I do) then C50 code will have extra instructions that are used solely to change the current address pointer. This is done in the C50 example twice, with MAR instructions.
Because of the fact that the assembly language instructions are different between DSP chips, and that the programs can become extremely difficult to design and maintain when they incorporate many functions in a real-time environment, many DSP developers will choose to develop their code in C language. Most (if not all) commercially available DSP chips now have C compilers available that will generate assembly language code from the higher level language of C. C provides certain advantages over assembly code, such as portability, speed of development, and maintainability, but inherently some inefficiencies will result in the compiled code. Because of this, developers will typically code their algorithms in C and then test the compiled code to make sure that real-time performance is met. If it is not, then the various module or modules can be identified and re-written in optimized assembly language, and then called via the main C program. As a matter of fact, C language code development can be seen to be influencing the DSP chip architecture itself. The C30 was designed from the start to be efficiently coded with C, and this can be seen in its larger, more general purpose register file among other things.
Final System Considerations
A potential pitfall of practical implementations of sampled systems such as the one here is when a digital-to-analog converter is used to convert the DSP algorithm output back to an analog signal. Practical DACs implement what is called a zero-order sample-and-hold on the output signal, which essentially clamps the output voltage at a particular analog value for the entire clocking period. An analog smoothing filter is then used to eliminate this "stairstep" function. Because of the sample-and-hold, the traditional impulse sampling function that is referred to in discrete time signal processing textbooks as "ideal sampling" is convolved with a rectangular pulse, equal in duration to the time between clock pulses. The outcome of this convolution process can be visualized as a multiplication of the frequency spectrum with the Fourier transform of a rectangle, a sin(x)/x function. This function is plotted in Figure 13, and note that for a standard baseband DAC that the frequency response will be down to approximately 0.6 at the Nyquist frequency of 0.5! This is quite a serious degradation of the system frequency response that we so carefully tried to design our filter to in the first place!
There are basically three solutions to this problem. The first is to use what is called a sin(x)/x compensation filter, which essentially is an analog peaking filter at the output, designed such that the overall output is not attenuated. This is the technique that is used in the AIC provided on the two EVM boards that I have used. In this case, the programmer need only worry about initializing the AIC correctly so that the compensation filter is used. The second solution can be used if the available DAC does not have the above mentioned built-in compensation filter, and that is the ability to program the compensation into the DSP code itself. Texas Instruments has detailed a method for designing a first-order digital filter that can be programmed in about 7 instructions per sampling interval, for very little overhead6. The third solution is to use an oversampling DAC. A typical oversampling rate for commercial DACs is 8 times, and a plot of how the sin(x)/x function for this case compares to the baseband case is shown in Figure 13. In this case, very little distortion of the frequency response is introduced.
Summary
This article has shown a complete algorithm-to-assembly code design example for developing real-time FIR filters on DSP chips, and the basic principles demonstrated can be extended to many different algorithms and applications. Because of the widespread availability of personal computers and reasonably priced software packages such as C compilers, it is fairly easy to develop an algorithm simulation environment to test performance prior to implementation. Even if it is not desired to experiment at an algorithmic level, one can go right to the implementation phase by utilizing design software available with textbooks and from other commercial sources. To go along with the personal computer, several DSP chip manufacturers are now offering low priced DSP starter kits that give the home experimenter the ability to generate and down-load assembly language programs from the PC for around $100. Because these boards generally have analog I/O interfaces included on them, they are a natural fit for amateur radio experimentation.
FIR filters implemented on DSP chips can provide dramatic results in amateur radio applications, and these algorithms can be the starting point for the design and implementation of more sophisticated functions such as IIR filters and adaptive noise reduction. As prices go down and processing power goes up, DSP chips are starting to work their way up in the radio architecture from predominantly audio processing, to performing IF processing functions such as demodulation. What better time could there be to start your own DSP experiments?
References
1Digital Filter Design, T.W. Parks and C.S. Burrus, John Wiley & Sons, 1987, ISBN 0-471-82896-3.
2Handbook for Digital Signal Processing, Editors Sanjit K. Mitra and James F. Kaiser, John Wiley & Sons, 1993, ISBN 0-471-61995-7.
3Digital Signal Processing - A Laboratory Approach Using PC-DSP, Oktay Alkin, Prentice Hall, 1994, ISBN 0-13-328139-6.
4Analog and Digital Filter Design Using C, Les Thede, Prentice Hall, 1996. ISBN 0-13-352627-5.
5C Algorithms for Real-Time DSP, Paul M. Embree, Prentice Hall, 1995, ISBN 0-13-337353-3.
6Linear Circuits Data Book - Volume 2, TLC32046 Wide-Band Analog Interface Circuit, Texas Instruments, 1992.