If you’re just joining us, this is the third article is a series of introductory articles to the command line. We are continuing from the article “Command Line TTY Devices”.
Before we get to the shell, we’re going to have a quick look at the terminal. The terminal itself is nothing more than a user interface. On its own it doesn’t really do much of anything. Terminals always setup a TTY device and usually starts a shell, although other software can be used. They can also come in various forms like terminal multiplexers tmux
and screen
. For now we’re just going to look at what makes up a terminal interface.
Binary and Text Data
To understand the terminal interface, we’re going to take a quick look at how data is represented by computers internally. Everyone knows computers work in 1’s and 0’s also know as binary1. Lets briefly review how this is done.
When we count we do so in a numbering system called base 10. Base 10 means we have 10 digits in our numbering system 0-9. When we get to the number 9 another digit is added to get the next number 10. It works the same way with binary also known as base 2. There are 2 digits 0-1 and after 1 we add digits just like base 10.
Base 10 | Base 2 |
---|---|
010 | 02 |
110 | 12 |
210 | 102 |
310 | 112 |
410 | 1002 |
510 | 1012 |
The basic size of a number used by computers is 8 digits or bits also know as a byte. A byte has 256 values 0 to 255 (or 127 to -128 if we use signed values). When bigger numbers are required the number of digits/bits used is doubled, increasing to 16 bits, 32 bits, 64 bits and so on. Once we have 16 bits or higher there’s a quirk between different computer architectures known as endianness. There are two common endian systems, big and little. A big endian system orders the bytes the way you’d expect starting with the most significant digit going down to the least significant digit. A little endian system reverses this order.
Bits | Value | Little Endian | Big Endian |
---|---|---|---|
8 | 000000002 | 000000002 | 000000002 |
16 | 11111111 000000002 | 00000000 111111112 | 11111111 000000002 |
32 | 00000000 10000001 00000000 100110012 | 10011001 00000000 10000001 000000002 | 00000000 10000001 00000000 100110012 |
On an individual computer endianness is completely transparent and doesn’t affect anything. But in a world connected by the internet, sharing binary data between different systems makes this a very important issue (32768 on one system could equal 1 on another). This might not seem to have much to do with terminal interfaces, but it gives insight into why text formatted data is still, to this day, a hugely popular format preferred over raw binary data.
Data as Text
There’s nothing more embarrassing than a teacher intercepting a note you passed to a friend in class and reading it everyone. So we learn to create secret key codes with our friends to prevent other people from reading our notes. The simplest way to do this is to substitute letters with numbers like A=1, B=2, C=3 and so on. For a computer, this type of code couldn’t keep a secret long, but it is used as a way to format information known as an encoding. The most common character to number key code is known as the ASCII encoding.
With the ASCII encoding system a character is represented by one byte. This means that ASCII can encode 256 individual characters. It covers all the basic Latin characters, numbers and punctuation, but if we want to support other languages like Russian or Chinese a different encoding is needed. Each encoding is known as a code page. The short coming of this encoding system is only a single language or family of languages can be supported at any given time.
Step in the Unicode encoding and the UTF-8 format. Unicode maps characters to a 32 bit number with the ability of encoding just over 4 billion characters. As described previously, 32 bit values are subject to endian issues. The UTF-8 format solves this by mapping values greater than 128 into several separate bytes therefore eliminating the endian issues2. Almost every conceivable language can be supported and easily shared across all systems.
If we look back at the hello program example in the previous article Command Line TTY Devices, the program can be rewritten to display “Hello World” by send number codes to standard out. When we compile these example programs, this is exactly what the compiler is doing to the quoted text in the program, converting it to encoded numbers.
#include <stdlib.h> #include <unistd.h> #include <stdio.h> int main(int argc, char *argv[]) { // Write "Hello World" to STDOUT using the ASCII number encodings. putchar(72); // H putchar(101); // e putchar(108); // l putchar(108); // l putchar(111); // o putchar(32); // space putchar(87); // W putchar(111); // o putchar(114); // r putchar(108); // l putchar(100); // d putchar(10); // newline return EXIT_SUCCESS; // Return success }
The Terminal Interface
Terminals are text encoded interfaces. They can use encodings like ASCII, code pages or Unicode. Typically a new terminal window starts with an 80×25 grid of characters and a cursor. Each character can have it’s own colours, be bold or underlined. On the surface, terminals seem extremely simple and in many ways they are but can have an endless amount of configurable options. If you look at the manual for xterm, man xterm
, it goes on and on with no end in sight. Terminal software can have a lot of features that can vary for each different software suite.
I’m not going to go into all the features in this article, but an important one that all terminals support are escape codes. We’ve seen how we can display information on the terminal by writing to standard out. For a program to use the features of the terminal, like colouring text or move the cursor around, the ASCII escape character (33) is written to standard out with a number of other codes describing the desired effect.
Here’s a simple example using escape codes with the bash shell and in an xterm compatible terminal.
echo -e "Using escape codes: 3[01;32mI'm in color3[00m"
There are a lot of common escape codes supported by the different terminal software, but each terminal also have their own unique codes as well. In later articles, when we start looking at shell scripting, I’ll show how you can use programs like tput
to take advantage of the terminal features with having to worry about the differences in escape codes.
Now that we have trudged through all this different components and systems we can move on to the heart of it all the “Command Line Shells”…
Footnotes:
Internally binary is represented as a circuit either turn on or off. It can also be capacitors being charged or discharged (similar to how RAM works) or various magnetic storage being magnetized or not (like hard drives or backup tapes).
The first 128 characters in UTF-8 are backwards compatible with the ASCII encoding making most text file compatible with both encodings.
Pingback: Command Line TTY Devices – Digital Combine
Keep up the excellent work !! Lovin’ it!
Pingback: Command Line Shells – Digital Combine
Thanks, this website is extremely handy.