So what the heck is a char, anyway?


Posted by Diego Assencio on 2015.12.14 under Programming (C/C++)

From the type name, one may be fooled into thinking that a char variable always represents a character — an ASCII character to be precise. However, char in C/C++ is not a "character type" but instead an "integer type". Indeed, the char type is simply an 8-bit integer type on virtually every platform (including all mobile phones and desktop computers).

The C standard does not define whether char is a signed or unsigned type. Even though the standard defines char, signed char and unsigned char as three different types, it requires C compilers to treat the char type as either signed char or unsigned char (see section 6.2.5, item 15 of the C11 standard). Some compilers allow you to choose what is best for your project. For instance, gcc can be executed with the -fsigned-char or -funsigned-char flags to have char be interpreted as signed char or unsigned char respectively.

This flexible signedness of the char type paves the way for many dangerous pitfalls. For example, the following code will run in finite time if char is unsigned, but will loop forever if char is signed:

#include <stdio.h>

int main()
{
	for (char c = 0; c < 200; ++c)
	{
		printf("%c\n", c);
	}

	return 0;
}

If char is an unsigned type, the program will finish because an unsigned 8-bit integer can store any values ranging from 0 to 255. However, a signed 8-bit integer can only hold values ranging from -128 to 127, i.e., it can never be larger than or equal to 200 and therefore the condition on the for loop will never evaluate to false. To be precise, when evaluating the condition c < 200 with char being a signed type, c is first promoted to int and then compared with the (int) value 200; since int has at least 16 bits of length, it can represent 200, but even if we promote c to int, the resulting value will never exceed 127. Every time c becomes 127, the update statement ++c will make it overflow to -128 and so on.

Below is also a classic example of how an incorrect assumption of the signedness property of the char type can lead to unwanted program behavior:

#include <stdio.h>

int main()
{
	/* WRONG: getchar() returns int, not char! */
	char c = getchar();

	while (c != EOF)
	{
		printf("%c\n", c);
		c = getchar();
	}

	return 0;
}

This program will never stop consuming input if char is unsigned. This happens because EOF is defined as the integer value -1 (on stdio.h), a value that an unsigned char cannot represent. Specifically, in the two's complement representation of integer values used by modern computers, -1 is 0xffffffff (assuming int has 32 bits of length, but the same argument follows on systems where int has 16 bits of length). When this value is written on an unsigned char, it is truncated to a one-byte integer value 0xff. This is where things go bad: when comparing c with EOF on line 8 of the code above, c is promoted to int, and its value 0xff is converted to 0x000000ff, which is 255 in decimal notation. So the condition c != EOF will always be true, even when getchar returns EOF to indicate an error or the end of input. This is how the program above should have been written instead:

#include <stdio.h>

int main()
{
	/* CORRECT: getchar() returns int! */
	int i = getchar();

	while (i != EOF)
	{
		char c = i;
		printf("%c\n", c);
		i = getchar();
	}

	return 0;
}

Notice how, on our very first sample code above, we used the post-increment operator on a variable of type char. In general, since char is an arithmetic type, all statements on the code below are valid:

int main()
{
	/*
	 * since c is used here explicitly as a numeric type,
	 * it should have been declared as either signed char
	 * or unsigned char (alternatively, int8_t or uint8_t)
	 */
	char c;

	/* assignment statements */
	c = 35;
	c = '#';

	/* shortcut assignments */
	c *= 2;
	c /= 3;
	c %= 25;

	/* increment and decrement operators */
	c++;
	--c;

	/* arithmetic expressions */
	c = 'a' + '#' - 20;
	c = 2*'c' + '#'/7;

	return 0;
}

It is important to keep in mind that as char is an integer type, it can overflow (as was the case in our first example). To have your code always behave the same way regardless of which compiler is used, consider using signed/unsigned char instead of plain char. Whenever you wish to perform arithmetic operations using 8-bit integers, use the int8_t and uint8_t types instead of a char type. These 8-bit integer types are defined on stdint.h.

As the code just shown illustrates, an expression such as '#' represents the (integer) value of the ASCII code assigned to this symbol. Since the code assigned to '#' is 35, the two assignment statements on the program above are equivalent to each other because expressions like '#' are replaced with their associated integer values when a program is compiled. Therefore, a char type is no more similar to a "character" than any other integer type. It is merely the smallest integer type which can represent any of the 128 characters from the ASCII standard.

Playing safe with char variables

If you wish to avoid trouble when dealing with characters through variables of type char, do not compare them with integer values; instead, compare them only with ASCII characters enclosed in single quotes since these comparisons are guaranteed to be valid. Additionally, it is always better to use portable functions such as isalpha or isdigit for checking whether a char variable satisfies a given property instead of checking if its value falls within a range of integer values which represents a specific set of characters such as "alphanumeric" or "digit". Finally, whenever the goal is doing integer arithmetic on 8-bit integer types, use int8_t and uint8_t if possible instead of a char type.

Comments

No comments posted yet.

Leave a reply

NOTE: A name and a comment (max. 1024 characters) must be provided; all other fields are optional. Equations will be processed if surrounded with dollar signs (as in LaTeX). You can post up to 5 comments per day.