Main Content

Misuse of sign-extended character value

Data type conversion with sign extension causes unexpected behavior

Description

This defect occurs when you convert a signed or plain char variable containing possible negative values to a wider integer data type (or perform an arithmetic operation that does the conversion) and then use the resulting value in one of these ways:

  • For comparison with EOF (using == or !=)

  • As array index

  • As argument to a character-handling function in ctype.h, for instance, isalpha() or isdigit()

If you convert a signed char variable with a negative value to a wider type such as int, the sign bit is preserved (sign extension). This can lead to specific problems even in situations where you think you have accounted for the sign bit.

For instance, the signed char value of -1 can represent the character EOF (end-of-file), which is an invalid character. Suppose a char variable var acquires this value. If you treat var as a char variable, you might want to write special code to account for this invalid character value. However, if you perform an operation such as var++ (involving integer promotion), it leads to the value 0, which represents a valid value '\0' by accident. You transitioned from an invalid to a valid value through the arithmetic operation.

Even for negative values other than -1, a conversion from signed char to signed int can lead to other issues. For instance, the signed char value -126 is equivalent to the unsigned char value 130 (corresponding to an extended character '\202'). If you convert the value from char to int, the sign bit is preserved. If you then cast the resulting value to unsigned int, you get an unexpectedly large value, 4294967170 (assuming 32-bit int). If your code expects the unsigned char value of 130 in the final unsigned int variable, you can see unexpected results.

The underlying cause of this issue is the sign extension during conversion to a wider type. Most architectures use two's complement representation for storing values. In this representation, the most significant bit indicates the sign of the value. When converted to a wider type, the conversion is done by copying this sign bit to all the leading bits of the wider type, so that the sign is preserved. For instance, the char value of -3 is represented as 11111101 (assuming 8-bit char). When converted to int, the representation is:

11111111 11111111 11111111  11111101
The value -3 is preserved in the wider type int. However, when converted to unsigned int, the value (4294967293) is no longer the same as the unsigned char equivalent of the original char value. If you are not aware of this issue, you can see unexpected results in your code.

Risk

In the following cases, Bug Finder flags use of variables after a conversion from char to a wider data type or an arithmetic operation that implicitly converts the variable to a wider data type:

  • If you compare the variable value with EOF:

    A char value of -1 can represent the invalid character EOF or the valid extended character value '\377' (corresponding to the unsigned char equivalent, 255). After a char variable is cast to a wider type such as int, because of sign extension, the char value -1, representing one of EOF or '\377' becomes the int value -1, representing only EOF. The unsigned char value 255 can no longer be recovered from the int variable. Bug Finder flags this situation so that you can cast the variable to unsigned char first (or avoid the char-to-int conversion or converting operation before comparison with EOF). Only then, a comparison with EOF is meaningful. See Sign-Extended Character Value Compared with EOF.

  • If you use the variable value as an array index:

    After a char variable is cast to a wider type such as int, because of sign extension, all negative values retain their sign. If you use the negative values directly to access an array, you cause buffer overflow/underflow. Even when you account for the negative values, the way you account for them might result in incorrect elements being read from the array. See Sign-Extended Character Value Used as Array Index.

  • If you pass the variable value as argument to a character-handling function:

    According to the C11 standard (Section 7.4), if you supply an integer argument that cannot be represented as unsigned char or EOF, the resulting behavior is undefined. Bug Finder flags this situation because negative char values after conversion can no longer be represented as unsigned char or EOF. For instance, the signed char value -126 is equivalent to the unsigned char value 130, but the signed int value -126 cannot be represented as unsigned char or EOF.

Fix

Before conversion to a wider integer data type, cast the signed or plain char value explicitly to unsigned char.

If you use the char data type to not represent characters but simply as a smaller data type to save memory, your use of sign-extended char values might avoid the risks mentioned earlier. If so, add comments to your result or code to avoid another review. See:

Examples

expand all

#include <stdio.h>
#include <stdlib.h>
#define fatal_error() abort()

extern char parsed_token_buffer[20];

static int parser(char *buf)
{
    int c = EOF;
    if (buf && *buf) {
        c = *buf++;    
    }
    return c;
}

void func()
{
    if (parser(parsed_token_buffer) == EOF) { 
        /* Handle error */
        fatal_error();
    }
}

In this example, the function parser can traverse a string input buf. If a character in the string has the value -1, it can represent either EOF or the valid character value '\377' (corresponding to the unsigned char equivalent 255). When converted to the int variable c, its value becomes the integer value -1, which is always EOF. The later comparison with EOF will not detect if the value returned from parser is actually EOF.

Correction — Cast to unsigned char Before Conversion

One possible correction is to cast the plain char value to unsigned char before conversion to the wider int type. Only then can you test if the return value of parser is really EOF.

#include <stdio.h>
#include <stdlib.h>
#define fatal_error() abort()

extern char parsed_token_buffer[20];

static int parser(char *buf)
{
    int c = EOF;
    if (buf && *buf) {
        c = (unsigned char)*buf++;    
    }
    return c;
}

void func()
{
    if (parser(parsed_token_buffer) == EOF) { 
        /* Handle error */
        fatal_error();
    }
}
#include <limits.h>
#include <stddef.h>
#include <stdio.h>

#define NUL '\0'
#define SOH 1    /* start of heading */
#define STX 2    /* start of text */
#define ETX 3    /* end of text */
#define EOT 4    /* end of transmission */
#define ENQ 5    /* enquiry */
#define ACK 6    /* acknowledge */

static const int ascii_table[UCHAR_MAX + 1] =
{
      [0]=NUL,[1]=SOH, [2]=STX, [3]=ETX, [4]=EOT, [5]=ENQ,[6]=ACK,
      /* ... */
      [126] = '~',
      /* ... */
      [130/*-126*/]='\202',
      /* ... */
      [255 /*-1*/]='\377'
};

int lookup_ascii_table(char c)
{
    int i;
    i = (c < 0 ? -c : c);
    return ascii_table[i];
}

In this example, the char variable c is converted to the int variable i. If c has negative values, they are converted to positive values before assignment to i. However, this conversion can lead to unexpected values when i is used as array index. For instance:

  • If c has the value -1 representing the invalid character EOF, you want to probably treat this value separately. However, in this example, a value of c equal to -1 leads to a value of i equal to 1. The function lookup_ascii_table returns the value ascii_table[1] (or SOH) without the invalid character value EOF being accounted for.

    If you use the char data type to not represent characters but simply as a smaller data type to save memory, you need not worry about this issue.

  • If c has a negative value, when assigned to i, its sign is reversed. However, if you access the elements of ascii_table through i, this sign reversal can result in unexpected values being read.

    For instance, if c has the value -126, i has the value 126. The function lookup_ascii_table returns the value ascii_table[126] (or '~') but you probably expected the value ascii_table[130] (or '\202').

Correction – Cast to unsigned char

To correct the issues, avoid the conversion from char to int. First, check c for the value EOF. Then, cast the value of the char variable c to unsigned char and use the result as array index.

#include <limits.h>
#include <stddef.h>
#include <stdio.h>

#define NUL '\0'
#define SOH 1    /* start of heading */
#define STX 2    /* start of text */
#define ETX 3    /* end of text */
#define EOT 4    /* end of transmission */
#define ENQ 5    /* enquiry */
#define ACK 6    /* acknowledge */

static const int ascii_table[UCHAR_MAX + 1] =
{
      [0]=NUL,[1]=SOH, [2]=STX, [3]=ETX, [4]=EOT, [5]=ENQ,[6]=ACK,
      /* ... */
      [126] = '~',
      /* ... */
      [130/*-126*/]='\202',
      /* ... */
      [255 /*-1*/]='\377'
};

int lookup_ascii_table(char c)
{
    int r = EOF;
    if (c != EOF) /* specific handling EOF, invalid character */
        r = ascii_table[(unsigned char)c]; /* cast to 'unsigned char' */
    return r;
}

Result Information

Group: Programming
Language: C | C++
Default: On for handwritten code, off for generated code
Command-Line Syntax: CHARACTER_MISUSE
Impact: Medium

Version History

Introduced in R2017a