Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rtextures.c] ColorToInt(RED) is UB #3987

Closed
OetkenPurveyorOfCode opened this issue May 17, 2024 · 12 comments
Closed

[rtextures.c] ColorToInt(RED) is UB #3987

OetkenPurveyorOfCode opened this issue May 17, 2024 · 12 comments

Comments

@OetkenPurveyorOfCode
Copy link
Contributor

Minimal example:

#include "raylib.h"

int main(void)
{
    ColorToInt(RED);
    return 0;
}

Produces the following error:

../raylib/src\rtextures.c:4513:33: runtime error: left shift of 230 by 24 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../raylib/src\rtextures.c:4513:33 in

The issue is that any color with a high enough RED value causes the signed integer to overflow which is undefined as per the C standard.

This happens here:

int ColorToInt(Color color)
{
    int result = (((int)color.r << 24) | ((int)color.g << 16) | ((int)color.b << 8) | (int)color.a);

    return result;
}

I tired rewriting it either to:

int ColorToInt(Color color)
{
    unsigned int unsinged_result = (((unsigned int)color.r << 24) | ((unsigned int)color.g << 16) | ((unsigned int)color.b << 8) | (unsigned int)color.a);
    int result;
    memcpy(&result, &unsigned_result, 4);
    return result;
}

which assumes that int is 4 bytes and stored as two's complement. (That would be UB on another machine but that does not affect me, the signed overflow affects me though)

Or this ugly dance around (https://stackoverflow.com/questions/5129498/how-to-cast-or-convert-an-unsigned-int-to-int-in-c):

        int result;
        if (unsigned_result < INT_MAX) {
            result = unsigned_result;
        }
        else {
            result = -(unsigned int)(~unsigend_result) -1;
        }

(which also assume 2's complement which is only guaranteed in the latest C standard (but I am compiling with -std=c2x, so that would not be an issue for bleeding edge people like me)

Both run fine without triggering UBSAN for the entire range of uints (though retyping them in this issue may have introduced errors)

Or should the ColorToInt signature to be changed to return an unsigned int (probably a breaking change)?

@JayLCypher
Copy link
Contributor

Potential fix?

int ColorToInt(Color color)
{
    return (int)(((unsigned)color.r << 24) | ((unsigned)color.g << 16) | ((unsigned)color.b << 8) | color.a);
}

The core problem stems from explicit implicit upcast from unsigned char goes to int for operator << I believe (from my little testing).
Casting to unsigned makes enough room for shift-left 24, which can then be sign-casted explicitly to integer.
Doesn't trigger UB san from my godbolt.

Neat issue though!

@orcmid
Copy link
Contributor

orcmid commented May 20, 2024

@JayLCypher

Potential fix?

int ColorToInt(Color color)
{
    return (int)(((unsigned)color.r << 24) | ((unsigned)color.g << 16) | ((unsigned)color.b << 8) | color.a);
}

Neat issue though!

Yes.

And it only now occurs to me that this assumes int is at least 32 bits, as opposed to the at-least 16 bits of the ISO C language specification.

@OetkenPurveyorOfCode
Copy link
Contributor Author

OetkenPurveyorOfCode commented May 20, 2024

int ColorToInt(Color color)
{
    return (int)(((unsigned)color.r << 24) | ((unsigned)color.g << 16) | ((unsigned)color.b << 8) | color.a);
}

I am probably the wrong person to ask on such matters as UB.
I looked it up in the final draft of the C99 standard. Of importance is particularily the third paragraph.

6.3.1.3 Signed and unsigned integers
1 When a value with integer type is converted to another integer type other than _Bool, if
the value can be represented by the new type, it is unchanged.
2 Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or
subtracting one more than the maximum value that can be represented in the new type
until the value is in the range of the new type.49)
3 Otherwise, the new type is signed and the value cannot be represented in it; either the
result is implementation-defined or an implementation-defined signal is raised.

(https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf)
(it is the same in C23)

So, regarding the standard the behaviour is implementation-defined. So I looked up what clang does (did not find it) but here is gcc:

The result of, or the signal raised by, converting an integer to a signed integer type when the value cannot be represented in an object of that type (C90 6.2.1.2, C99 and C11 6.3.1.3).
For conversion to a type of width N, the value is reduced modulo 2^N to be within range of the type; no signal is raised.

(https://gcc.gnu.org/onlinedocs/gcc-5.3.0/gcc/Integers-implementation.html#Integers-implementation)

I am guessing clang does the same as gcc. And clang's UBSAN seems to be happy when I run it as well. It is probably the best solution, you know I opened an issue instead of a PR because I dislike my memcpy/branchy concoction. So yes, I am in favour of @JayLCypher's implementation.

@Peter0x44
Copy link
Contributor

This also could be not quite correct for big endian systems (the return type, anyway). Not sure there are any good reasons to call this function.

@orcmid
Copy link
Contributor

orcmid commented May 20, 2024

This also could be not quite correct for big endian systems (the return type, anyway). Not sure there are any good reasons to call this function.

It is interesting that << is defined in terms of multiplication, so endian-ness is not a factor (although signed-ness is). Endien-ness might matter in whatever someone intends to do with the result though. That prospect has my head hurt.

@OetkenPurveyorOfCode
Copy link
Contributor Author

This also could be not quite correct for big endian systems (the return type, anyway). Not sure there are any good reasons to call this function.

My initital use case was to compare colors, which could be done differently (i. e. by comparing all the fields individually, or maybe with memcmp). in my implementation, I could have also used my own enum of Colors and used that before converting them with a switch statement to the Raylib color shortly before rendering. So in my case, you may be right, but I have no idea what are people are using it for. Regarding endian-ness, the documentation is not exactly clear in which order the fields go, it just states it would convert it to a "hexadecimal" value with GetColor doing the reverse. Furthermore, the implementation is endian-neutral, like it should be, it would produce the same value on a big endian vs. a little endian system.

@JayLCypher
Copy link
Contributor

I'm a pedantic fellow so I enjoy things written "flawlessly", which is what drew me to this issue.

In terms of endianness and bit-width, the structure for color requires 4 * sizeof (unsigned char) to be represented as a single value. I noticed in my tests that endian order mattered memory layout wise using a union, which is sub-optimal.

As for a use case of comparing colors, you could definitively roll a simple inline helper function. Something like:

bool IsSameColor(register const Color a, register const Color b)
{
    return (a.r == b.r) & (a.g == b.g) & (a.b == b.b) & (a.a == b.a);
}

(You can use int or bool return type, depending on your support. If you know the scope you can also make it static inline.
You could also make this a comparison function returning which color is "larger" or "smaller", if that's any use :⁾ )

As for the library function, it is a common issue I see of using non-fixed width integers where widths are needed (such as this case) and using integers as a fundamental type when you don't need signedness. That is old-school type C for ya. :)

The main issue is really that the upcast of an unsigned char ends up being integer when doing operator << on it, which is the bonkers thing to me. So that's why it's each color variable that is explicitly upcasted to 32-bit width unsigned before <<. Once the final value is set, it can be casted to int for return type which just changes the representation (it's now a negative number on LE). You can do this implicitly by binding the variables to a new (32-bit unsigned) variable too.

Cliteral version because I like to write C :)

int ColorToInt(register const Color c)
{
    return (int)((c.a) | (CLITERAL(unsigned){c.b} << 8) | (CLITERAL(unsigned){c.g} << 16) | (CLITERAL(unsigned){c.r} << 24));
}

@OetkenPurveyorOfCode
Copy link
Contributor Author

OetkenPurveyorOfCode commented May 21, 2024

@JayLCypher

In terms of endianness and bit-width, the structure for color requires 4 * sizeof (unsigned char) to be represented as a single value. I noticed in my tests that endian order mattered memory layout wise using a union, which is sub-optimal.

what I meant, is that code like this should work regardless of endianness:

assert(7991807 == ColorToInt(BLUE));
int red = 0xFF'00'00'00;
assert(ColorToInt((Color){.r=255}) == red);

Does it not?

I used ColorToInt because it was easier than writing three lines of a dedicated function.

bool is_same_color(Color a, Color b) {
    return a.r == b.r && a.g == b.g && a.b == b.b && a.a == b.a;
}

bool IsSameColor(register const Color a, register const Color b)
{
return (a.r == b.r) & (a.g == b.g) & (a.b == b.b) & (a.a == b.a);
}

I find your C code really interesting, it really showcases differences in style and I appreciate that
-> You use the keyword register, I on the other hand rely heavily on the compiler's register allocator. I do not even use the register keyword, in my mind, it is the second most useless keyword of C (along with auto, though I think C23 might change that for me, at least in macros)
-> I do not see much value in marking function arguments as const when they are passed by value, the function operates on a local copy anyway, so I do not think marking them const helps prevent bugs.
-> You use the bitwise and operator. I like to make the distinction between logical and and bitwise and, so I use bitwise and for checking flags, and unsetting flags as well as bit masking and use logical and to and together true/false results. I think there is a difference in that && is allowed to short circuit, but that is of no importance here. I guess you save on typing, but if you tried to do that you would not have typed all those parentheses.
-> And of course, we put the braces in different places, who doesn't?

Anyway I went with your original ColorToInt function, I see no need to initialize new unsigned integers using compound literals.

@OetkenPurveyorOfCode
Copy link
Contributor Author

#3996

@Peter0x44
Copy link
Contributor

You use the keyword register, I on the other hand rely heavily on the compiler's register allocator. I do not even use the register keyword, in my mind, it is the second most useless keyword of C (along with auto, though I think C23 might change that for me, at least in macros)

Agreed. There is no reason to ever write "register". I don't think it helps compilers today at all

I do not see much value in marking function arguments as const when they are passed by value, the function operates on a local copy anyway, so I do not think marking them const helps prevent bugs.

const in general has extremely questionable value. I don't use it.

what I meant, is that code like this should work regardless of endianness:

I decided to try, and it seems like it does:
https://gcc.godbolt.org/z/EGeMPzsvY
I used C++ constexpr functions since compiler explorer doesn't support executing code for any big endian system I could find, and GCC optimized main to return 1 for both little and big endian.
I'm not sure why I got confused here, my understanding is the color has to be stored as effectively 0xRRGGBBAA, while a little endian system would store the integer 0xRRGGBBAA as 0xAABBGGRR "in memory" (ignore that G and R obviously aren't valid hex...) and it could then compare wrong. That was my thinking, but it's clearly not the case.

@JayLCypher
Copy link
Contributor

@OetkenPurveyorOfCode
Like I said, I enjoy being pedantic and look for useful stuff. Does much of it matter? Not necessarily.

  • Found out about the register keyword recently so I am trying to reason about it. Many say it's useless because of the register functionality and compilers probably do a better job, but it also restricts another programmer from taking the address of the variable, so it can communicate intent. Kind of the same with Restrict keyword, which I like mostly because it says "Don't alias memory regions or I can explode."
  • I prefer const correctness, and yea all arguments are copy, everything is copy. But argument parameter const says something about function intent, and so does const variables in-scope. I find some part of programming is communicating to another person, whether it's you tomorrow or an actual different person.
  • I probably made a mistake using bit-wise there, as you mentioned with short-circuit. :) However, bit-wise condition checking with addition can be used as indexing into an array as a technique which I find useful and quite beautiful sometimes.
  • I probably put the braces like you for code I write, but Raylib uses this type of braces and (I'm going to attribute to Tsoding) had a good line about "you adopt to the style of the code base, not the code base to your style", which is really good advice I think.

Thanks for the compliment though. :)
Nice issue and conversation to all!

@OetkenPurveyorOfCode
Copy link
Contributor Author

OetkenPurveyorOfCode commented May 21, 2024

restricts another programmer from taking the address of the variable

Yes. But why would you restrict another programmer from taking its address? I struggle to think of a reason here or a scenario where this adds value.

I probably put the braces like you for code I write, but Raylib uses this type of braces and (I'm going to attribute to Tsoding) had a good line about "you adopt to the style of the code base, not the code base to your style", which is really good advice I think

the color_is_same function would be in my codebase, so my style. If I were to add it to Raylib, I would obviously call it ColorIsSame and use raylib's style.

I probably made a mistake using bit-wise there, as you mentioned with short-circuit.

Logical and short-circuits, yours does not. But I said, that this is not relevant here. It was to "communicate intent" as you would say.

Kind of the same with Restrict keyword, which I like mostly because it says "Don't alias memory regions or I can explode."

I would use restrict only to hint to the optimiser, it can enable simd optimisations and reduce rendundant loads, but I have not used it so far (I would have to have a suitable performance problem in the first place).

Anyway, I haven't been programming in C for long, so don't take my advise too seriously. Have fun!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants