Messed Up Characters In Webpages (especially Social Media)
Solution 1:
How is such a thing possible?
Unicode allows diacritical marks to be used in two ways.
The first is ‘composed’ form, where there is a single character for combined letter and diacritical, for example U+00E9 Latin Small Letter E with Acute é
.
The second is ‘decomposed’ form where you have a character for the base letter and then a separate ‘combining diacritical’ character after it. The text processor and/or font render the combination of these characters as one grapheme, for example U+0065 Latin Small Letter E followed by U+0301 Combining Acute é
. The advantage (and arguably disadvantage) of this is you can write combinations that don't have combined characters (typically because they were never used in any real language), such as x́
.
It's allowed to use multiple combining diacriticals on a single letter, as there are languages that use more than one accent on a letter (as well as other tricks combining characters are used for, like Korean Jamo and Tibetan joined letters). There is no inherent limit to how many combining characters may be used to make a single grapheme.
Many text processors will try to lay out multiple combining diacriticals by piling them up on top of each other (and in the other direction, for ‘below’ accents). In general this is a reasonable way to attempt to show a multiply-accented letter that the font in use doesn't have a specific glyph for. But it does mean you can go crazy and use absurd numbers of diacriticals to decorate way outside the normal text line.
how can we prevent things like that from happening in our website?
Simple solution would be to put each comment in its own block with CSS overflow: hidden
, so that they can't escape to other content.
Another possibility is to filter input for sequences of multiple combining characters. For example with regex you could remove:
\p{M}{9,}
as 8 is the longest sequence of combiners known in a natural language at present. You could possibly try a lower number if you only care about simple alphabets. For this you would need a regex engine with support for Unicode character classes (\p
), which some languages don't natively have. If you have a language without this but you do have access to the Unicode database (eg unicodedata
in Python) you could manually walk over the characters looking for those with an M
character class.
Post a Comment for "Messed Up Characters In Webpages (especially Social Media)"