A deep dive into the char
type in C/C++
From the name of the type, one might think that a char
variable always represents a character — an ASCII character to be precise. However, char
in C/C++ is not a "character type" but instead an integer type. The char
type is an 8-bit integer type on virtually every platform, including all mobile phones and desktop computers.
The C standard does not define whether char
is a signed or unsigned type. Even though the standard defines char
, signed char
, and unsigned char
as three distinct types, it requires C compilers to treat the char
type as either signed char
or unsigned char
(see section 6.2.5, item 15 of the C11 standard). Some compilers allow you to specify whether the char
type should be interpreted as signed char
or unsigned char
, depending on what is best for your project. For instance, gcc
can be executed with the -fsigned-char
or -funsigned-char
flags.
This flexible signedness of the char
type paves the way for many dangerous pitfalls. For example, the following code will run in finite time if char
is unsigned, but will loop indefinitely if char
is signed:
#include <stdio.h>
int main() {
for (char c = 0; c < 200; ++c) {
printf("%c\n", c);
}
return 0;
}
If char
is an unsigned type, the program will finish because an unsigned 8-bit integer can store values ranging from 0
to 255
. However, a signed 8-bit integer can only hold values between -128
and 127
, i.e., the condition c < 200
will always evaluate to true
. To be precise, when evaluating the condition c < 200
with char
being a signed type, c
is first promoted to int
and then compared with the (int
) value 200
(since int
has at least 16 bits, it can represent 200
). However, even when we cast c
to int
, the resulting value will never exceed 127
because every time c
reaches 127
, the update statement ++c
will cause it to overflow to -128
.
Below is a classic example of how an incorrect assumption about the signedness property of the char
type can lead to undesired program behavior:
#include <stdio.h>
int main() {
// DANGER: getchar() returns int, not char!
char c = getchar();
while (c != EOF) {
printf("%c\n", c);
c = getchar();
}
return 0;
}
If char
is unsigned, this program will continuously consume input because EOF
is defined as the integer value -1
in stdio.h
, a value that an unsigned char
cannot represent. Specifically, in the two's complement representation used by modern computers, -1
corresponds to 0xffffffff
(assuming a 32-bit int
; a similar argument holds for 16-bit systems). When this value is assigned to an unsigned char
, it gets truncated to 0xff
, so when EOF
is encountered, c
becomes 0xff
. When later compared with EOF
, c
is promoted to int
, and its value is converted to 0x000000ff
, which is 255
in decimal. As a result, the condition c != EOF
remains true even when getchar
returns EOF
, indicating an error or end of input. The corrected version of the program is provided below:
#include <stdio.h>
int main() {
// CORRECT: getchar() returns int.
int i = getchar();
while (i != EOF) {
char c = i;
printf("%c\n", c);
i = getchar();
}
return 0;
}
Note how, in our very first sample code above, we used the post-increment operator on a variable of type char
. Generally, since char
is an arithmetic type, all statements in the code below are valid:
int main() {
char c;
c = 35;
c = '#';
// Shortcut assignment operators.
c *= 2;
c /= 3;
c %= 25;
// Increment and decrement operators.
++c;
--c;
// Arithmetic expressions.
c = 'a' + '#' - 20;
c = 2 * 'c' + '#' / 7;
return 0;
}
It's vital to remember that since char
is an integer type, it can overflow, as demonstrated in our first example. To ensure consistent behavior across different compilers, consider using signed char
or unsigned char
instead of plain char
. For arithmetic operations with 8-bit integers, it's advisable to use the int8_t
and uint8_t
types instead of the char
type. These 8-bit integer types can be found in stdint.h
.
As illustrated in the program above, an expression like '#'
represents the integer value of the ASCII code associated with that symbol. Given that the code for '#'
is 35
, the two first assignment statements in the program are equivalent. This is because expressions such as '#'
are replaced by their corresponding integer values during compilation. Hence, a char
type isn't intrinsically more "character-like" than any other type. It's merely the smallest integer type capable of representing any of the 128 characters from the ASCII standard.
Safe practices with char
variables
To minimize potential issues when working with char
variables, avoid comparing them directly with integer values. Instead, compare them with ASCII characters enclosed in single quotes as these comparisons are always valid. For checking if a char
variable meets certain criteria, rely on portable functions like isalpha
or isdigit
, rather than verifying if its value is within a range of integers representing specific character sets, such as "alphanumeric" or "digit". Moreover, for arithmetic operations with 8-bit integers, it's advisable to use int8_t
or uint8_t
over the char
type, whenever possible.