You probably already know that when sending an SMS you have a relatively small number of characters available for each message, somewhere in the 140-160 character range. What you may not know (particularly if English is your native language) is that the character limit can be cut down to just 70 if you include Unicode characters in your message.
This problem has bitten us in two different apps over the past year or so. A sharing message, carefully crafted to fit well within a 140-character limit for simple sharing via SMS (or Twitter, or some other medium) while developing in English was being chopped into two or even three separate messages in other languages.
Let’s just ignore additional issues that can arise from the fact that many carriers have decided to handle text messages in their own, sometimes proprietary, way and the additional issues that come from those carriers trying to interoperate with each other and instead focus on the SMS standard which seems to apply in lots of situations. SMS was designed to have a data payload of 140 octets (140 8-bit bytes). So, depending on what character set you are using, you can squeeze 160 characters (7-bit ASCII), 140 characters (8-bit GSM), or 70 characters (16-bit Unicode) into that payload. Pretty quickly, we can see why we might get truncated messages in languages that use characters outside the 7-bit ASCII alphabet. So, we might expect major issues in Russian or Japanese, but maybe not so much in French or Italian.
But even text that appears to be harmless 7-bit ASCII at first glance can cause issues. Just one Unicode character in a message will cause the entire message to be sent in Unicode, slicing the character count in half. And that single Unicode character doesn’t have to be Russian, Japanese, Chinese, Greek, Tagalog or some other obvious standout. Something as simple as a curly apostrophe (’) instead of a straight apostrophe (‘) can break your messages and lead to much confusion and hair pulling. Even more difficult to diagnose is the non-breaking space (you know, good old ). We have received translations with non-breaking space Unicode characters hiding innocently alongside normal space characters, basically impossible to detect without a specific script checking for that character.
So, the moral of the story is this: if you have strings that you are getting localized and that you know are supposed to be used in an SMS, make sure you know exactly which characters are going into the message so that you can make proper adjustments for character counts.