Steganography with Zero Width Characters (Z-Chars)
Steganography the practice of hiding a message in something that is not secret. Certain unicode characters are zero-width, meaning that they won’t be printed and can’t be seen. Using these character we can encode characters and hide messages.
POLITE NOTICE There’s enough trash on the web without hidden characters gumming up our messages, please don’t use this in the wild.
Hiding characters in text
Wherever text is represented digitally, it’s likely to be encoded using Unicode. Some Unicode characters are intentionally invisible, a carriage return for example (
↵) can’t be seen although we can see it’s effect. Such characters are known as control characters.
Because not all languages read from left to right, some require special mark-up to ensure they run from right to left. Take the word “Egypt” as an example.
Computers need to know which direction these characters run in and one way to indicate this is with control characters.
There are nine such characters as specified in the Unicode Bidirectional Algorithm. The nine characters are:
U+2066 LEFT-TO-RIGHT ISOLATE U+2067 RIGHT-TO-LEFT ISOLATE U+2068 FIRST-STRONG ISOLATE U+202A LEFT-TO-RIGHT EMBEDDING U+202B RIGHT-TO-LEFT EMBEDDING U+202D LEFT-TO-RIGHT OVERRIDE U+202E RIGHT-TO-LEFT OVERRIDE U+202C POP DIRECTIONAL FORMATTING U+2069 POP DIRECTIONAL ISOLATE
I’m not going to be digging much into the details but hopefully the names alone provide some indication of their use. Let’s explore how they’ll print in the browser. We’ll be using the UTF-16 reference listed above. In JS we can access these characters with
'\uCODE'. The letter “A” for example is
'\u0041' // "A" '\u0041\u0042\u0043' // "ABC" '\u2066\u2067\u2068\u202A\u202B\u202D\u202E\u202C\u2069' // "" [Nothing is visible]
As you can see, our nine characters print as an empty string “”. Now let’s try to insert these characters between other letters.
'A' + '\u2066\u202A\u202D' + 'B' + '\u2066\u202A\u202D' + 'C' // "ABC"
Again the control characters between the letters can’t be seen.
Because the assumed use for this is hiding characters in the latin alphabet, we know the text will always run left to right, to guarantee this it’s safest to limit ourselves to the three left-to-right control characters:
U+2066 LEFT-TO-RIGHT ISOLATE U+202A LEFT-TO-RIGHT EMBEDDING U+202D LEFT-TO-RIGHT OVERRIDE
We now have the means to hide characters, all we need now is a way to encode them to other characters.
Encoding and Decoding
So we have our three characters with which we can encode our hidden letters and this is great because if you have an understanding of binary, you’ll know that we only need two
1. Let’s see how we can use these two characters to represent the letter “X”.
"X".charCodeAt(0) // 88 "X".charCodeAt(0).toString(2); // "1011000"
charCodeAt gives us the unicode character reference for our letter,
toString(2) then converts this to a binary string or a number with a base of 2. To get back to our “X” we can use
fromCodePoint to reverse the process.
parseInt("1011000",2) // 88 String.fromCodePoint(88) // "X" String.fromCodePoint(parseInt("1011000",2)) // "X"
Because we know we can represent “X” using 0 and 1, it’s a short step to swapping these characters out to the control characters that can’t be seen.
1 => U+2066 0 => U+202A 1 => U+2066 1 => U+2066 0 => U+202A 0 => U+202A 0 => U+202A
So an X hidden between A and B might look like this:
'A'+'\u2066\u202A\u2066\u2066\u202A\u202A\u202A'+'B' // AB
Because we have three characters however, it makes sense to use them all. This changes our base (or radix) to three. Encoding and decoding three characters look like this:
// ENCODING "X".charCodeAt(0) // 88 "X".charCodeAt(0).toString(3); // "10021" [Comprises three characters 0,1 and 2] // DECODING parseInt("10021",3) // 88 String.fromCodePoint(88) // "X" String.fromCodePoint(parseInt("10021",3)) // "X"
Let’s take another example by hiding “HIDDEN” within the word “VISIBLE”. Because our z-chars fit between our visible letters, the number of characters we can encode is the number of visible letters minus one. So our visible letters demark our hidden ones.
Here’s a visualisation of the encoding:
Other than a few functions to split strings and interpolate the hidden characters, this all but covers the core concepts behind Z-Chars. Pray you never use them to this effect.
Please take a look through the source code if you’re curiouser still.