A deep dive into the char type in C/C++

From the name of the type, one might think that a char variable always represents a character — an ASCII character to be precise. However, char in C/C++ is not a "character type" but instead an integer type. The char type is an 8-bit integer type on virtually every platform, including all mobile phones and desktop computers.

The C standard does not define whether char is a signed or unsigned type. Even though the standard defines char, signed char, and unsigned char as three distinct types, it requires C compilers to treat the char type as either signed char or unsigned char (see section 6.2.5, item 15 of the C11 standard). Some compilers allow you to specify whether the char type should be interpreted as signed char or unsigned char, depending on what is best for your project. For instance, gcc can be executed with the -fsigned-char or -funsigned-char flags.

This flexible signedness of the char type paves the way for many dangerous pitfalls. For example, the following code will run in finite time if char is unsigned, but will loop indefinitely if char is signed:

#include <stdio.h>

int main() {
  for (char c = 0; c < 200; ++c) {
    printf("%c\n", c);
  }

  return 0;
}

If char is an unsigned type, the program will finish because an unsigned 8-bit integer can store values ranging from 0 to 255. However, a signed 8-bit integer can only hold values between -128 and 127, i.e., the condition c < 200 will always evaluate to true. To be precise, when evaluating the condition c < 200 with char being a signed type, c is first promoted to int and then compared with the (int) value 200 (since int has at least 16 bits, it can represent 200). However, even when we cast c to int, the resulting value will never exceed 127 because every time c reaches 127, the update statement ++c will cause it to overflow to -128.

Below is a classic example of how an incorrect assumption about the signedness property of the char type can lead to undesired program behavior:

#include <stdio.h>

int main() {
  // DANGER: getchar() returns int, not char!
  char c = getchar();

  while (c != EOF) {
    printf("%c\n", c);
    c = getchar();
  }

  return 0;
}

If char is unsigned, this program will continuously consume input because EOF is defined as the integer value -1 in stdio.h, a value that an unsigned char cannot represent. Specifically, in the two's complement representation used by modern computers, -1 corresponds to 0xffffffff (assuming a 32-bit int; a similar argument holds for 16-bit systems). When this value is assigned to an unsigned char, it gets truncated to 0xff, so when EOF is encountered, c becomes 0xff. When later compared with EOF, c is promoted to int, and its value is converted to 0x000000ff, which is 255 in decimal. As a result, the condition c != EOF remains true even when getchar returns EOF, indicating an error or end of input. The corrected version of the program is provided below:

#include <stdio.h>

int main() {
  // CORRECT: getchar() returns int.
  int i = getchar();

  while (i != EOF) {
    char c = i;
    printf("%c\n", c);
    i = getchar();
  }

  return 0;
}

Note how, in our very first sample code above, we used the post-increment operator on a variable of type char. Generally, since char is an arithmetic type, all statements in the code below are valid:

int main() {
  char c;

  c = 35;
  c = '#';

  // Shortcut assignment operators.
  c *= 2;
  c /= 3;
  c %= 25;

  // Increment and decrement operators.
  ++c;
  --c;

  // Arithmetic expressions.
  c = 'a' + '#' - 20;
  c = 2 * 'c' + '#' / 7;

  return 0;
}

It's vital to remember that since char is an integer type, it can overflow, as demonstrated in our first example. To ensure consistent behavior across different compilers, consider using signed char or unsigned char instead of plain char. For arithmetic operations with 8-bit integers, it's advisable to use the int8_t and uint8_t types instead of the char type. These 8-bit integer types can be found in stdint.h.

As illustrated in the program above, an expression like '#' represents the integer value of the ASCII code associated with that symbol. Given that the code for '#' is 35, the two first assignment statements in the program are equivalent. This is because expressions such as '#' are replaced by their corresponding integer values during compilation. Hence, a char type isn't intrinsically more "character-like" than any other type. It's merely the smallest integer type capable of representing any of the 128 characters from the ASCII standard.

Safe practices with `char` variables

To minimize potential issues when working with char variables, avoid comparing them directly with integer values. Instead, compare them with ASCII characters enclosed in single quotes as these comparisons are always valid. For checking if a char variable meets certain criteria, rely on portable functions like isalpha or isdigit, rather than verifying if its value is within a range of integers representing specific character sets, such as "alphanumeric" or "digit". Moreover, for arithmetic operations with 8-bit integers, it's advisable to use int8_t or uint8_t over the char type, whenever possible.

Safe practices with char variables

Safe practices with `char` variables