From Binary to Character Representations

At the core of every computer system in the world is a CPU which accepts and acts upon streams of 0’s and 1’s. A “0” value is considered off and a “1” value is considered on. Why is this though, why do computers break everything down to these values? Further, how does binary translate from English-speaking countries to non-English speaking countries, or vice versa?

Without going into a history lesson, nor a scientific explanation for that matter, which both can lead you down several rabbit holes, let’s go ahead and discuss this a bit further and hopefully answer that question. My desire is to clarify this to the point of at least having a high-level overview of how computers work and how we’re able to communicate the way we do.

(B)inary Dig(its) – bits

For starters, computer systems’ circuitry is powered by, well, electricity. This flow of electricity can be controlled, throttling this flow can be registered as “off” and allowing this flow can be considered as “on”. The current flows through transistors which can be set to “on” or “off”, 1 or 0. In its simplest form, this is the process of determining a “binary digit” – a bit for short. If electricity is freely flowing through the transistor, then a “1” is registered, otherwise it is considered as off and assigned a “0”. There are many more complexities in how transistors are used and organized but that’s beyond the scope of what I am writing.

So now that we have a bunch of 0’s and 1’s, what do we do with it?

Base 2 vs Base 8 vs Base 10 vs Base 16

Let’s compare a few numbering methods with our current numbering system – base 10. Base 10 refers to the fact that there are 10 different numbers, 0 – 9. Further, each number is valued at its placement in that number; single unit column, tens column, hundreds column, thousands column, etc. If there is a “9” in the one’s column, adding more to it causes a “carry over” to the next column; 9 + 2 = 11, so we went from a number solely in the one’s column and added to it causing a carry over into the ten’s column.

Let’s take a random number as an example, let’s say “1024”. We then evaluate the full number based on the placement of each digit; “4” is in the ones column, so 1 * 4 = 4; “2” in the tens column indicates 10 * 2 = 20; “0” in the hundreds column, 100 * 0 = 0; and “1” in the thousands column, 1000 * 1 = 1000. Then if we add up each product (4 + 20 + 0 + 1000), we come up with the full number – 1024. The reason this method is so easy and convenient for humans is because we have 10 fingers. Let’s compare this method of counting with the follow formats – binary, octal, and hexadecimal.

The binary format is also known as base 2, because there are “2” values that can be used – either “0” or “1”. One main differentiator is the fact that each column doubles in size from the previous column. So instead of how base 10 goes ones, tens, hundreds, etc, binary goes ones, twos, fours, eights, 16, etc.

As a brief example if we do some (simple) binary math – 0101 + 0111 = 01100 – we can see that adding two 1’s together causes the same “carry over”, as the state wouldn’t be anything other than 0 or 1. Let’s start at the ones column; 1 + 1 = 2 in traditional math but not binary, instead 1 + 1 = 10 because we carry over the additional one into the next twos column. Referring back to the example, 01 + 11 = 100 because in the one’s column, we carry over the additional digit (the extra “1”), which then is added to the already-existing 0 + 1. So, the equation is 1 (carried digit) + 0 + 1. Since we get “2” again, we carry over the extra digit.

If you thought this was fun, then let’s add to that. There can be some rather long and hard-to-read binary streams. Considering that applications and such are constant streams, imagine a Matrix-like stream passing in front of you. That’d be tough to read with convenience. Fortunately, our engineers formatted various ways to better represent longer streams of binary data – base 8 or octal, and base 16 or hexadecimal. Base 8 refers to 8 different numbers, 0 – 7. Once you hit 7, the numbering starts over a “0” again and you start anew. Base 16 goes from 0 – 9 but once 10 is up, we use the alphabet ranging from “A” to “F”. So if you were to write “10” in hexadecimal, it’d be represented as “A”, “11” is “B”, etc. up to “16” and “F”.

Character Sets

In order to translate the binary data back in to readable text, there needs to be some standardized format for distinguishing values. Think of this as a legend on a map describing what symbols mean. An example for English-speaking countries, computers use the American Standard Code for Information Interchange (ASCII) character set. It has a specific format that represents 128 possible options for lower and uppercase letters (A, B, C, a, b, c, etc.), numbers (1, 2, 3, etc.), as well as special characters (!, @, #, etc.). See this link for more details about the ASCII character set as well as the hexadecimal reference but here is also a convenient image:

Now non-English-speaking countries naturally wouldn’t have the same character set, due to the fact that they don’t use the same letters we do. In addition, some characters require more bits for their representation than our standard alphabet, which only adds to the complexity. Fortunately, the standardizing bodies worked on a better format – Unicode.

Unicode was created in an attempt to create a single character set that included every writing system. Up until then, however, many countries and governing bodies made their own decisions and it seemed rather chaotic – at least back in the day. If you’re curious for more examples, refer to this link.

Conclusion

With the desire to include languages worldwide, a character set that is aware of these differences was needed. Unicode is this attempt at being able to map each character from various languages and have a corresponding place on a computer in its binary format. There are some good references pertaining to this blog that I highly encourage you to read if the high-level overview doesn’t make sense.

References

Beal, V. (September 13, 2009). Characters and ASCII Equivalents. Retrieved from https://www.webopedia.com/quick_ref/asciicode.asp

Spolsky, J. (October 8, 2003). The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!). Retrieved from https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Weinand, I. Retrieved from https://www.bottomupcs.com/index.xhtml

Wikipedia contributors. (2018, September 24). Character encoding. In Wikipedia, The Free Encyclopedia. Retrieved 22:16, September 28, 2018, from https://en.wikipedia.org/w/index.php?title=Character_encoding&oldid=860955850

Comments

Tom Harger says
October 13, 2018 at 11:59 am
Except for the unicode bit, I learned this stuff when I was in college, back in the ’70s. My first job had me programming a boot-up sequence in binary on the front panel of the computer. Needless to say, this wan’t a very useful article for me. 😉
- Emil Hozan says
  October 25, 2018 at 2:31 pm
  Thank you for your comment.
  Your past experience is quite interesting, I can only imagine the fun you had in doing that project!
  Is there anything that should have been added on this topic? Or is there a topic that you’d like for us to write about?

Related

Stay in Touch

Recent Posts

Search

Archives

Share This: