Fairly simple requirement - get a first letter of profile description and present it together with a link. You get the idea, if I have two profiles, Foo and Bar, I want two links with F and B respectively.
The first version of the code (not even mentioned here) was just something like: if string has at least one character, take uppercase of the first character.
This seemingly simple approach completely ignores Emojis which are handled in C# strings as two consecutive chars. The second version of the code was then:
public static string ToShortDescription( this string source ) { var description = source?.Trim(); if ( !string.IsNullOrWhiteSpace( description ) && description.Length >= 1 ) { if ( char.IsSurrogatePair( description, 0 ) ) { return description.Substring( 0, 2 ).ToUpper(); } else { return description.Substring( 0, 1 ).ToUpper(); } } return "?"; }
This is better, much better. It's not just the first character of the string, it's the substring that has the length of 2. This simple approach correctly handles many two-char Emojis, like the male mage Emoji, 🧙, encoded as 🧙.
Unfortunately, it's just the beginning of the story. It turns out some Emojis are combined from other Emojis. Let's take the mage emoji. Its female version, 🧙♀️, is encoded as male mage Emoji followed by additional characters to indicate female version (🧙‍♀️). The special character used to glue together emojis is the Zero-Width-Joiner (ZWJ).
Take a C# string that starts with the female mage emoji. This time it's not the 2 characters that should be taken from it, now it's 5! The two-char Emoji, the ZWJ, and another two-char Emoji!
Let this sink in - in order to have a single visible character on the screen, we need to take 5 first characters of the C# string!
And as you can expect, the above version of code correctly discovers the first surrogate but fails to discover the ZWJ.
There's even a discussion on SO on how to detect this.
My current approach is
public static string ToShortDescription( this string source, bool autoUpper = true ) { var description = source?.Trim(); if ( !string.IsNullOrWhiteSpace( description ) && description.Length >= 1 ) { // należy brać kolejne znaki na następujących zasadach // * jeśli zwykły znak - bierze się i koniec // * jeśli zjw - bierze się i nie koniec // * jeśli surrogatepair bierze się dwa i nie koniec char[] sourceChars = source.ToCharArray(); List<char> destChars = new List<char>(); var index = 0; bool takeAgain; bool zjw; do { takeAgain = false; // czy jest jeden i jeszcze jeden za nim (dwuznaki) if ( index < sourceChars.Length - 1 ) { // surogat if ( char.IsSurrogatePair( sourceChars[index], sourceChars[index + 1] ) ) { destChars.AddRange( new[] { sourceChars[index], sourceChars[index + 1] } ); index += 2; takeAgain = true; } } if ( index < sourceChars.Length - 2 ) { // zjw - skleja dwa emoji if ( sourceChars[index] == (char)8205 ) { destChars.AddRange( new[] { sourceChars[index], sourceChars[index + 1], sourceChars[index + 2] } ); index += 3; takeAgain = true; } } } while ( takeAgain && index < sourceChars.Length ); // weź jeszcze jeden jeśli jeszcze nie ma nic lub zjw if ( !takeAgain && index <= sourceChars.Length-1 && destChars.Count == 0 ) { destChars.Add( sourceChars[index] ); } string _result = new string( destChars.ToArray() ); return autoUpper ? _result.ToUpper() : _result; /* if ( char.IsSurrogatePair( description, 0 ) ) { return description.Substring( 0, 2 ).ToUpper(); } else { return description.Substring( 0, 1 ).ToUpper(); } */ } return "?"; }
This passes some important unit tests. Namely, it correctly handles the England Emoji flag emoji, the 🏴 (🏴󠁧󠁢󠁥󠁮󠁧󠁿), which still is a single visible sign but in this extreme case it's the first 14 characters of the C# string! I believe there's still a room for improvement (and possibly other strange cases that I still miss).
No comments:
Post a Comment