Now Reading
C Strings and my gradual descent to insanity

C Strings and my gradual descent to insanity

2023-04-06 07:22:10

Credit score to

For stating that “有り難う” means “thanks” as an alternative of “hi there”. I acquired my examples blended up after I was coding and missed this. Thanks for the correction!

I’ve been on a C kick lately as I study the intricacies concerned in low stage programming. As a Knowledge Scientist/Python Programmer I work with strings on a regular basis. Folks say that dealing with strings in C vary wherever from tough to downright terrible. I used to be curious so I made a decision to see how deep the rabbit gap went.

C strings are an array of characters that finish with a null terminator, . When C manipulates strings the null terminator is what tells capabilities that the top of the string has been reached. In C we declare a string in two other ways. The primary and most tough method is as a literal character array.

#embody <stdio.h>

int foremost() {
  char myString[] = {'H', 'e', 'l', 'l', 'o', ',', ' ', 'W', 'o', 'r', 'l', 'd','!','n',''};
    printf("%s", myString);
    return 0;
}

That is error inclined and requires you to insert the null terminator your self. It additionally takes perpetually to jot down for lengthy phrases, 0/10. The second method is as a string enclosed in double quotes.

#embody <stdio.h>

int foremost() {
    char myString[] = "Whats up, World!n";
    printf("%s", myString);
    return 0;
}

On this state of affairs C is aware of precisely how lengthy the string is and may routinely insert the null terminator.

After getting correctly encoded a string, there are various operations you may carry out. Widespread capabilities on strings embody strcpy, strlen, and strcmp. strcpy copies the string saved in a single variable to a different. strlen will get the size of a string (minus the null terminator), and strcmp takes two strings and returns 0 when they’re true. Sadly, there’s a number of nuances when working with string capabilities. First, let’s check out an instance of every beginning with strcpy.

int foremost() {
  char supply[] = "Whats up, world!";
  char vacation spot[20];

  strcpy(vacation spot, supply); // Copy the supply string to the vacation spot string

  printf("Supply: %sn", supply);
  printf("Vacation spot: %sn", vacation spot);

  return 0;
}

Right here is the output of our code from above.

Supply: Whats up, world!
Vacation spot: Whats up, world!

As you might need anticipated strcpy works by copying a string and placing its contents in one other string. However you is perhaps asking. “Why can’t I simply assign the supply variable on to the vacation spot variable?”

int foremost() {
  char supply[] = "Whats up, world!";
  char* vacation spot = supply;

  strcpy(vacation spot, supply); // Copy the supply string to the vacation spot string

  printf("Supply: %sn", supply);
  printf("Vacation spot: %sn", vacation spot);

  return 0;
}

You may. It’s simply that vacation spot now turns into a char* and exists as a pointer to the supply character array. If that isn’t what you need them it will virtually definitely trigger points.

Our subsequent string operation is strlen, which will get the dimensions of the string minus the null terminator.

#embody <stdio.h>
#embody <string.h>

int foremost() {
  char str[] = "Whats up, world!"; // The string to seek out the size of

  int size = strlen(str); // Discover the size of the string

  printf("The size of the string '%s' is %d.n", str, size);

  return 0;
}

The output of our print operate for strlen appears like this.

The size of the string 'Whats up, world!' is 13.

That is fairly simple. It simply counts the characters till it hits the null terminator.

Our final operate is strcmp. It appears at two strings and determines whether or not they’re equal to one another or not. If they’re it returns 0. In the event that they aren’t it returns 1

.

#embody <stdio.h>
#embody <string.h>

int foremost() {
  char str1[] = "Whats up, world!"; 
  char str2[] = "hi there, world!"; 

  int consequence = strcmp(str1, str2); // Examine the 2 strings

  if (consequence == 0) {
    printf("The strings are equal.n");
  } else {
    printf("The %s shouldn't be equal to %sn", str1, str2);
  }

  return 0;
}

The output of our strcmp operate….

The strings should not equal

Now that we all know how you can copy, get the size, and examine our strings I’ll drop the bombshell. None of those are secure operations, and it’s simple to create undefined habits. This principally revolves round using because the null terminator. Within the case of the above C capabilities in addition to others, C expects to discover a which tells the operate to cease studying the realm in reminiscence the place the string lives. However what if there is no such thing as a null terminator? Nicely, C will fortunately hold studying the contents in reminiscence after the string ought to have ended. If our program’s operate was purported to confirm the consumer’s provided password, a nasty actor might probably skip the realm in reminiscence the place the password examine occurs and go proper into the operate that is known as for password successes, by exploiting a buffer overflow with the string. This could circumvent your complete authorization course of. So how will we deal with this?

Gif of someone pointing, putting on the helmet and saying safety first on a bicycle

You candy summer time youngster

In the event you do some poking round, you would possibly discover a operate referred to as strncpy. Whenever you take a look at its definition, you’ll see that it copies a supply string right into a vacation spot string and lets you specify the variety of bytes to repeat. “This appears good!” you would possibly say. I can be sure that my vacation spot string solely receives as many bytes as it may well deal with. Right here is a few code beneath that illustrates that in addition to its output.

#embody <stdio.h>
#embody <string.h>

#outline dest_size 12
int foremost(){
    char supply[] = "Whats up, World!";
    char dest[dest_size];
    
    // Copy at most 12 characters from supply to dest
    strncpy(dest, supply, dest_size);
    
    printf("Supply string: %sn", supply);
    printf("Vacation spot string: %sn", dest);
    
    return 0;
}
Supply string: Whats up, World!
Vacation spot string: Whats up, World

At first this appears nice, however there’s a drawback. What occurs when the supply string minus the null terminator is so long as the dimensions of the vacation spot string? The reply is that the vacation spot will get crammed with all of the characters of the supply string with no room left for the null terminator. A non-null terminated string will most definitely trigger you complications afterward. “Okay” you would possibly say. However not less than it may well deal with the case the place the supply string is smaller than the vacation spot string? And sure, it may well deal with that case, however so can strcpy. And if the supply string is smaller than the vacation spot string, all the additional house within the vacation spot string that isn’t used remains to be reserved and padded. So, if the vacation spot string is 20 characters lengthy however the supply string is barely 13 you get a vacation spot string that successfully appears like this.

char vacation spot[20] = {'H', 'e', 'l', 'l', 'o', ',', ' ', 'W', 'o', 'r', 'l', 'd', '!', '', '', '', '', '', '', ''};

So no correct null termination and extreme padding. That isn’t nice. In the event you occur to be on Home windows and use the strncpy operate, the Microsoft Visible C++ (MSVC) Compiler received’t even compile this system. It’s a must to manually set a flag to permit utilizing deprecated options, which is a touch you need to in all probability not be utilizing it.

It suggests utilizing strncpy_s as an alternative. Let’s take a look at that now. strncpy_s accepts these parameters…

  • char *prohibit dest: The vacation spot string

  • rsize_t destsz: The dimensions of the vacation spot string

  • const char *prohibit src: The supply string to be copied

  • rsize_t rely: The utmost variety of bytes to repeat from the supply string

If the vacation spot string is longer than the supply string than all the things will get copied high-quality. But when the vacation spot is smaller than the supply, then solely the dimensions of the vacation spot -1 is copied over. The extra examine that strncpy_s makes is that it ensures that even when a supply string is copied right into a vacation spot string, the ensuing string is at all times null terminated. That is nice, however once more we’ve got two issues.

  1. strncpy_s doesn’t deal with extreme padding

  2. strncpy_s shouldn’t be moveable to macOS or Linux

Now if at this level you’re shaking your fist on the sky cursing the C requirements committee for dragging their toes on implementing moveable safer string operations for 34 years, I don’t blame you.

So how can we deal with this case safely? There are a couple of methods I can consider.

  1. In case you are working with strings of a identified size like in our contrived instance it may very well be so simple as initializing our vacation spot string to the sizeof() supply string.

  2. You possibly can simply work with a pointer to the supply string and forgo copying utterly. As long as the supply string is correctly terminated you received’t need to cope with mismatched buffer sizes.

  3. You possibly can forgo portability and use the _s variations of string capabilities on Home windows, or “l” variations of them on macOS.

  4. You possibly can use a distinct language ????

By this level you might have observed that I’ve spent a number of time speaking about strcpy however have solely briefly talked about strcmp and strlen. All of them endure from the identical difficulty that stems from the best way C strings are terminated. As a result of the size of a string is unknown till you hit a null terminator you get all types of undefined habits and assault vectors. That is in distinction to C++ which treats strings as objects and encodes the size of the string together with the variety of characters. This is among the explanation why folks have a tendency to jot down C in C++. Use all of the bits you think about “good” and ignore all the things else.

To deal with these correctly in pure C requires rigorously implementing checks round string operations. That is error inclined, and tougher as this system will get bigger. This is among the explanation why C is taken into account an unsafe language.

Unicode was an vital step in textual content encoding for computer systems. At the moment, UTF-8 is the dominate encoding for textual content. I briefly summarize its historical past on this article Breaking the Snake: How Python went from 2 to 3, so I received’t belabor the purpose right here. C didn’t achieve Unicode help till C99, and even if you happen to deal with it correctly in C you may run into points in different methods as you will note in a second. If we attempt to print out some Japanese characters…

#embody <stdio.h>
#embody <string.h>

int foremost() {
    printf("有り難うn");
    return 0;
}

The output isn’t what we count on.

It is because we aren’t decoding the characters as Unicode characters. Let’s rewrite the code to repair that.

#embody <stdio.h>
#embody <wchar.h>
#embody <locale.h>

int foremost() { 
  setlocale(LC_ALL, ""); // Set the locale to the consumer's default locale
  wchar_t hi there[] = L"有り難う";
  wprintf(L"Whats up in Japanese is: %lsn", hi there);
  return 0;
}

I added the string “Whats up in Japanese is:” for a very good cause. In the event you take a look at the screenshot beneath you may see why. The output nonetheless isn’t proven.

Checking the encoding of the PowerShell Console we see that it’s in ASCII. Okay let’s change the encoding with $OutputEncoding = [System.Text.Encoding]::UTF8. Now it’s in UTF-8. It nonetheless doesn’t work. Perhaps it’s as a result of the font doesn’t help Japanese. Some fast Googling later I see that the MS Gothic font does, and I swap my font to that.

Screenshot of PowerShell in MS gothic

output of dir in PowerShell

My “” at the moment are “¥” however I can reside with that if this works. I named a check folder 有り難う to be sure that PowerShell shows it appropriately. If we take a look at the check folder now, we see kanji at the moment are being displayed appropriately. However even with that change the code nonetheless doesn’ t print the characters! I attempt setting the locale to ja_JP.UTF8, however nonetheless can’t get an output. Some extra googling and I come throughout an article title PowerShell console characters are garbled for Chinese language, Japanese, and Korean languages on Home windows Server 2022

which states….

See Also

By default, Home windows PowerShell .lnk shortcut is hardcoded to make use of the “Consolas” font. The “Consolas” font would not have the glyphs for CJK characters, so the characters aren’t rendered appropriately. Altering the font to “MS Gothic” explicitly fixes the problem as a result of the “MS Gothic” font has glyphs for CJK characters.

The Command Immediate (cmd.exe) would not have this difficulty, as a result of the cmd .lnk shortcut would not specify a font. The console chooses the appropriate font at runtime relying on the system language.

Decision

The problem shall be fastened in Home windows 11 and Home windows Server 2022 very quickly, however the repair will not be backported to decrease variations.

To work across the difficulty, use both of the next two workarounds.

Okay cool this doesn’t look like my actual difficulty, nevertheless it looks like PowerShell doesn’t deal with Japanese characters nicely by default. I attempt utilizing the Command Immediate with MS Gothic however that doesn’t remedy it both. All the things that I’ve Googled reveals that this ought to work In C. I swap the code again to…

#embody <stdio.h>
#embody <wchar.h>
#embody <locale.h>

int foremost() { 
  setlocale(LC_ALL, ""); // Set the locale to the consumer's default locale
  wchar_t hi there[] = L"有り難う";
  wprintf(L"Whats up in Japanese is: %lsn", hi there);
  return 0;
}

and I run it on my Raspberry PI and… It really works!

I attempt it on my Macbook Professional and it really works as nicely. I hearth up PowerShell on my Macbook Professional and… It nonetheless works… So fortunately this isn’t a bug in C, nevertheless it does seem like a difficulty with the best way Home windows handles non-Latin characters in its terminals. Alls I can say is… Go simple on Microsoft. They’re a small indie dev…. However for actual if anybody is aware of how you can get this to work on Home windows 10 let me know!

Now that we will print Japanese characters appropriately in C, lets take a look at one closing case earlier than this text will get too lengthy. As we noticed earlier getting the size of a string will be performed with strlen. If we modify our naive C code from the start to get the strlen of the Japanese string it appears like this…

#embody <stdio.h>
#embody <string.h>

int foremost() {
  printf("The size of the string is %d charactersn", strlen("有り難う"));
  return 0;
}

and the output is…

The size of the string is 12 characters

If we return to our unique print output, we will see this string of characters that’s printed out incorrectly.

You’ll discover there are 12 characters. The explanation for that is we’re decoding the string as ascii. Since kanji take a couple of byte to encode, every byte in a 4 kanji phrase is being interpreted as a person letter, versus every cluster of bytes being related as a kanji. If we modify the string to a w_char as an alternative of a char by prefixing it with a “L”, and use wcslen as an alternative of strlen we get the code beneath…

#embody <stdio.h>
#embody <wchar.h>
#embody <locale.h>

int foremost() {
  printf("The size of the string is %d charactersn", wcslen(L"有り難う"));
  return 0;
}

which prints out…

The size of the string is 4 characters

Pleasure.

This text solely scratches the floor in the case of working with C strings. We didn’t even have time to the touch on the Unicode literals launched in C11 like ‘u8’, ‘u’, and ‘U’. For sure, when working with C strings you must watch out. At a minimal you may trigger a headache for your self by creating undefined habits. Then again, you may probably create an assault vector for somebody to use. In the event you’ve solely labored in rubbish collected programming languages, you is perhaps questioning why undergo the difficulty to work in C. In the event you take a look at a language like Python, and the libraries that’s makes use of within the knowledge science discipline, most of them are constructed on high of C and C++. Somebody’s acquired to do it, and in case you have the information, virtually all languages have a C international operate interface that may be leveraged to hurry up code, so the advantages will normally carry over to different languages as nicely. So, study you some C, however perhaps don’t begin with strings first.

In the event you made it this far thanks for studying! In case you are new welcome! I like to speak about expertise, area of interest programming languages, AI, and low-level coding. I’ve lately began a Twitter and would love so that you can test it out. In the event you appreciated the article, think about liking and subscribing. And if you happen to haven’t why not take a look at one other article of mine! Thanks on your beneficial time.

The basics of Arm64 Assembly

This post is geared towards beginners, but don’t worry if you don’t understand all of it at once. It might take a couple of rereads or some extra googling. That’s okay! Just don’t forget to ask questions if you get stuck???? Programming in assembly is not as hard as it seems. While it’s true that one line o…

Read more

a month ago · 1 like · 2 comments · Diego Crespo

Share Deus In Machina



Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top