Fairly simple requirement - get a first letter of profile description and present it together with a link. You get the idea, if I have two profiles, Foo and Bar, I want two links with F and B respectively.
The first version of the code (not even mentioned here) was just something like: if string has at least one character, take uppercase of the first character.
This seemingly simple approach completely ignores Emojis which are handled in C# strings as two consecutive chars. The second version of the code was then:
public static string ToShortDescription( this string source )
{
var description = source?.Trim();
if ( !string.IsNullOrWhiteSpace( description ) && description.Length >= 1 )
{
if ( char.IsSurrogatePair( description, 0 ) )
{
return description.Substring( 0, 2 ).ToUpper();
}
else
{
return description.Substring( 0, 1 ).ToUpper();
}
}
return "?";
}
This is better, much better. It's not just the first character of the string, it's the substring that has the length of 2. This simple approach correctly handles many two-char Emojis, like the
male mage Emoji, 🧙, encoded as 🧙.
Unfortunately, it's just the beginning of the story. It turns out some Emojis are combined from other Emojis. Let's take the mage emoji. Its female version, 🧙♀️, is encoded as male mage Emoji followed by additional characters to indicate female version (🧙‍♀️). The special character used to glue together emojis is the
Zero-Width-Joiner (ZWJ).
Take a C# string that starts with the female mage emoji. This time it's not the 2 characters that should be taken from it, now it's 5! The two-char Emoji, the ZWJ, and another two-char Emoji!
Let this sink in - in order to have a single visible character on the screen, we need to take 5 first characters of the C# string!
And as you can expect, the above version of code correctly discovers the first surrogate but fails to discover the ZWJ.
My current approach is
public static string ToShortDescription( this string source, bool autoUpper = true )
{
var description = source?.Trim();
if ( !string.IsNullOrWhiteSpace( description ) && description.Length >= 1 )
{
// należy brać kolejne znaki na następujących zasadach
// * jeśli zwykły znak - bierze się i koniec
// * jeśli zjw - bierze się i nie koniec
// * jeśli surrogatepair bierze się dwa i nie koniec
char[] sourceChars = source.ToCharArray();
List<char> destChars = new List<char>();
var index = 0;
bool takeAgain;
bool zjw;
do
{
takeAgain = false;
// czy jest jeden i jeszcze jeden za nim (dwuznaki)
if ( index < sourceChars.Length - 1 )
{
// surogat
if ( char.IsSurrogatePair( sourceChars[index], sourceChars[index + 1] ) )
{
destChars.AddRange( new[] { sourceChars[index], sourceChars[index + 1] } );
index += 2;
takeAgain = true;
}
}
if ( index < sourceChars.Length - 2 )
{
// zjw - skleja dwa emoji
if ( sourceChars[index] == (char)8205 )
{
destChars.AddRange( new[] { sourceChars[index], sourceChars[index + 1], sourceChars[index + 2] } );
index += 3;
takeAgain = true;
}
}
} while ( takeAgain && index < sourceChars.Length );
// weź jeszcze jeden jeśli jeszcze nie ma nic lub zjw
if ( !takeAgain &&
index <= sourceChars.Length-1 &&
destChars.Count == 0
)
{
destChars.Add( sourceChars[index] );
}
string _result = new string( destChars.ToArray() );
return autoUpper ? _result.ToUpper() : _result;
/*
if ( char.IsSurrogatePair( description, 0 ) )
{
return description.Substring( 0, 2 ).ToUpper();
}
else
{
return description.Substring( 0, 1 ).ToUpper();
}
*/
}
return "?";
}
This passes some important unit tests. Namely, it correctly handles the
England Emoji flag emoji, the 🏴 (🏴󠁧󠁢󠁥󠁮󠁧󠁿), which still is a single visible sign but in this extreme case it's the first 14 characters of the C# string! I believe there's still a room for improvement (and possibly other strange cases that I still miss).