Perform Fixed-Point Arithmetic

This example shows how to perform basic fixed-point arithmetic operations.

Save the current state of all warnings before beginning.

warnstate = warning;

Addition and Subtraction

When you add two unsigned fixed-point numbers, you may need a carry bit to correctly represent the result. For this reason, when adding two B-bit numbers with the same scaling, the resulting value has an extra bit compared to the two operands used.

a = fi(0.234375,0,4,6);
c = a + a

c = 
    0.4688

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Unsigned
            WordLength: 5
        FractionLength: 6

a.bin

ans = 
'1111'

c.bin

ans = 
'11110'

With signed, two's-complement numbers, a similar scenario occurs because of the sign extension required to correctly represent the result.

a = fi(0.078125,1,4,6);
b = fi(-0.125,1,4,6);
c = a + b

c = 
   -0.0469

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 5
        FractionLength: 6

a.bin

ans = 
'0101'

b.bin

ans = 
'1000'

c.bin

ans = 
'11101'

If you add or subtract two numbers with different precision, the radix point first needs to be aligned to perform the operation. The result is that there is a difference of more than one bit between the result of the operation and the operands, depending on how far apart the radix points are.

a = fi(pi,1,16,13);
b = fi(0.1,1,12,14);
c = a + b

c = 
    3.2416

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 18
        FractionLength: 14

Further Considerations for Addition and Subtraction

The following pattern is not recommended. Because scalar additions are performed at each iteration in the for loop, a bit is added to temp during each iteration. As a result, instead of a bit growth of ceil(log2(Nadds)), the bit growth is equal to Nadds.

s = rng; rng('default');
b = fi(4*rand(16,1)-2,1,32,30);
rng(s); % restore RNG state
Nadds = length(b) - 1;
temp = b(1);
for n = 1:Nadds
    temp = temp + b(n+1); % temp has 15 more bits than b
end

If the sum command is used instead, the bit growth is curbed.

c = sum(b) % c has 4 more bits than b

c = 
    7.0059

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 36
        FractionLength: 30

Multiplication

In general, a full-precision product requires a word length equal to the sum of the word lengths of the operands. In this example, the word length of the product c is equal to the word length of a plus the word length of b. The fraction length of c is also equal to the fraction length of a plus the fraction length of b.

a = fi(pi,1,20);
b = fi(exp(1),1,16);
c = a*b

c = 
    8.5397

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 36
        FractionLength: 30

Assignment

When you assign a fixed-point value into a predefined variable, quantization might be involved. In such cases, the right-hand side of the expression is quantized by rounding to nearest and then saturating, if necessary, before assigning to the left-hand side.

N = 10;

a = fi(2*rand(N,1)-1,1,16,15);
b = fi(2*rand(N,1)-1,1,16,15);
c = fi(zeros(N,1),1,16,14);

for n = 1:N
    c(n) = a(n).*b(n);
end

When the product a(n).*b(n) is computed with full precision, an intermediate result with wordlength 32 and fraction length 30 is generated. That result is then quantized to a word length of 16 bits and a fraction length of 14 bits. The quantized value is then assigned to the element c(n).

Quantize Results Explicitly

Often, it is not desirable to round to nearest or to saturate when quantizing a result because of the extra logic and computation required. It may also be undesirable to have to assign to a left-hand side value to perform the quantization. You can use the quantize function for such purposes. A common case is a feedback-loop. If no quantization is introduced, unbounded bit growth will occur as more input data is provided.

a = fi(0.1,1,16,18);
x = fi(2*rand(128,1)-1,1,16,15);
y = fi(zeros(size(x)),1,16,14);

for n = 1:length(x)
    z    = y(n);
    y(n) = x(n) - quantize(a.*z,true,16,14,'Floor','Wrap');
end

In this example, the product a.*z is computed with full precision and is then quantized to a wordlength of 16 bits and a fraction length of 14 bits. The quantization is done by rounding to floor (truncation) and allowing for wrapping if overflow occurs. Quantization still occurs at assignment because the expression x(n) - quantize(a.*z,...) produces an intermediate result of 18 bits and y is defined to have 16 bits.

To eliminate the quantization at assignment, you can introduce an additional explicit quantization so that no round to nearest or saturation logic is used. Because the left-hand side result has the same 16 bit word length and fraction length of 14 bits as y(n), no quantization is necessary.

a = fi(0.1,1,16,18);
x = fi(2*rand(128,1)-1,1,16,15);
y = fi(zeros(size(x)),1,16,14);
T = numerictype(true,16,14);


for n = 1:length(x)
    z    = y(n);
    y(n) = quantize(x(n),T,'Floor','Wrap') - ...
           quantize(a.*z,T,'Floor','Wrap');
end

Non-Full-Precision Sums

Full-precision sums are not always desirable and can result in complicated and inefficient generated C code. For example, the intermediate result x(n) - quantize(...) in the above example has an 18-bit word length. Instead, it may be desirable to keep all results of addition and subtraction to 16 bits. You can use the accumpos and accumneg functions to keep the results of addition and subtraction to 16 bits.

a = fi(0.1,1,16,18);
x = fi(2*rand(128,1)-1,1,16,15);
y = fi(zeros(size(x)),1,16,14);

T = numerictype(true, 16, 14);

for n = 1:length(x)
    z    = y(n);
    y(n) = quantize(x(n),T);                 % defaults: 'Floor','Wrap'
    y(n) = accumneg(y(n),quantize(a.*z, T)); % defaults: 'Floor','Wrap'
end

Model Accumulators

You can use the accumpos and accumneg functions to model accumulators. The behavior of accumpos and accumneg corresponds to the += and -= operators in C, respectively. A common example is an FIR filter in which the coefficients and input data are represented with 16 bits. The multiplication is performed in full precision, yielding 32 bits, and an accumulator with 8 guard bits. In total, 40 bits are used to enable up to 256 accumulations without the possibility of overflow.

b = fi(1/256*[1:128,128:-1:1],1,16); % Filter coefficients
x = fi(2*rand(300,1)-1,1,16,15);     % Input data
z = fi(zeros(256,1),1,16,15);        % Used to store the states
y = fi(zeros(size(x)),1,40,31);      % Initialize Output data

for n = 1:length(x)
    acc = fi(0,1,40,31); % Reset accumulator
    z(1) = x(n);        % Load input sample
    for k = 1:length(b)
        acc = accumpos(acc,b(k).*z(k)); % Multiply and accumulate
    end
    z(2:end) = z(1:end-1); % Update states
    y(n) = acc;            % Assign output
end

Matrix Arithmetic

To simplify syntax and shorten simulation time, you can use matrix arithmetic. For the FIR filter example, you can replace the inner loop with an inner product.

z = fi(zeros(256,1),1,16,15); % Used to store the states
y = fi(zeros(size(x)),1,40,31);

for n = 1:length(x)
    z(1) = x(n);
    y(n) = b*z;
    z(2:end) = z(1:end-1);
end

The inner product b*z is performed with full precision. Because this is a matrix operation, the bit growth is due to both the multiplication involved and the addition of the resulting products. Therefore, the bit growth depends on the length of the operands. In this example, b and z have length 256, resulting in an 8-bit growth due to the additions. The inner product results in 32 + 8 = 40 bits, with a fraction length of 31 bits. No quantization occurs in the assignment y(n) = b*z because y was initialized to this format.

If you had to perform an inner product for more than 256 coefficients, the bit growth would be more than 8 bits beyond the 32 needed for the product. If you only had a 40-bit accumulator, you could model the behavior by either introducing a quantizer, as in y(n) = quantize(Q,b*z), or you could use the accumpos function.

Model a Counter

You can use accumpos to model a simple counter which wraps after reaching its maximum value. For example, you can model a 3-bit counter as follows.

c = fi(0,0,3,0);
Ncounts = 20; % Number of times to count
for n = 1:Ncounts
    c = accumpos(c,1);
end

Because the 3-bit counter wraps back to 0 after reaching 7, the final value of the counter is mod(20,8) = 4.

Arithmetic with Other Built-In Data Types

In the C language, the result of an operation between an integer data type and a double data type promotes to a double. However, in MATLAB®, the result of an operation between a built-in integer data type and a double data type is an integer. In this respect, the fi object behaves like the built-in integer data types in MATLAB. The result of an operation between a fi and a double is a fi.

`fi * double`

When doing multiplication between fi and double, the double is cast to a fi with the same word length and signedness of the fi, and best-precision fraction length. The result of the operation is a fi.

a = fi(pi);
b = 0.5 * a

b = 
    1.5708

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 32
        FractionLength: 28

`fi + double` or `fi - double`

When doing addition or subtraction between fi and double, the double is cast to a fi with the same numerictype as the fi. The result of the operation is a fi.

a = fi(pi);
b = a + 1

b = 
    4.1416

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 17
        FractionLength: 13

`fi * int8`

When doing arithmetic between fi and one of the built-in integer data types, [u]int[8,16,32,64], the word length and signedness of the integer are preserved. The result of the operation is a fi.

a = fi(pi);
b = int8(2) * a

b = 
    6.2832

          DataTypeMode: Fixed-point: binary point scaling
            Signedness: Signed
            WordLength: 24
        FractionLength: 13

Restore warning states.

warning(warnstate);
%#ok<*NASGU,*NOPTS>