Fairly simple requirement - get a first letter of profile description and present it together with a link. You get the idea, if I have two profiles, Foo and Bar, I want two links with F and B respectively.
The first version of the code (not even mentioned here) was just something like: if string has at least one character, take uppercase of the first character.
This seemingly simple approach completely ignores Emojis which are handled in C# strings as two consecutive chars. The second version of the code was then:
public static string ToShortDescription( this string source )
{
var description = source?.Trim();
if ( !string.IsNullOrWhiteSpace( description ) && description.Length >= 1 )
{
if ( char.IsSurrogatePair( description, 0 ) )
{
return description.Substring( 0, 2 ).ToUpper();
}
else
{
return description.Substring( 0, 1 ).ToUpper();
}
}
return "?";
}
This is better, much better. It's not just the first character of the string, it's the substring that has the length of 2. This simple approach correctly handles many two-char Emojis, like the male mage Emoji, 🧙, encoded as 🧙.
Unfortunately, it's just the beginning of the story. It turns out some Emojis are combined from other Emojis. Let's take the mage emoji. Its female version, 🧙♀️, is encoded as male mage Emoji followed by additional characters to indicate female version (🧙‍♀️). The special character used to glue together emojis is the Zero-Width-Joiner (ZWJ).
Take a C# string that starts with the female mage emoji. This time it's not the 2 characters that should be taken from it, now it's 5! The two-char Emoji, the ZWJ, and another two-char Emoji!
Let this sink in - in order to have a single visible character on the screen, we need to take 5 first characters of the C# string!
And as you can expect, the above version of code correctly discovers the first surrogate but fails to discover the ZWJ.
There's even a discussion on SO on how to detect this.
My current approach is
public static string ToShortDescription( this string source, bool autoUpper = true )
{
var description = source?.Trim();
if ( !string.IsNullOrWhiteSpace( description ) && description.Length >= 1 )
{
// należy brać kolejne znaki na następujących zasadach
// * jeśli zwykły znak - bierze się i koniec
// * jeśli zjw - bierze się i nie koniec
// * jeśli surrogatepair bierze się dwa i nie koniec
char[] sourceChars = source.ToCharArray();
List<char> destChars = new List<char>();
var index = 0;
bool takeAgain;
bool zjw;
do
{
takeAgain = false;
// czy jest jeden i jeszcze jeden za nim (dwuznaki)
if ( index < sourceChars.Length - 1 )
{
// surogat
if ( char.IsSurrogatePair( sourceChars[index], sourceChars[index + 1] ) )
{
destChars.AddRange( new[] { sourceChars[index], sourceChars[index + 1] } );
index += 2;
takeAgain = true;
}
}
if ( index < sourceChars.Length - 2 )
{
// zjw - skleja dwa emoji
if ( sourceChars[index] == (char)8205 )
{
destChars.AddRange( new[] { sourceChars[index], sourceChars[index + 1], sourceChars[index + 2] } );
index += 3;
takeAgain = true;
}
}
} while ( takeAgain && index < sourceChars.Length );
// weź jeszcze jeden jeśli jeszcze nie ma nic lub zjw
if ( !takeAgain &&
index <= sourceChars.Length-1 &&
destChars.Count == 0
)
{
destChars.Add( sourceChars[index] );
}
string _result = new string( destChars.ToArray() );
return autoUpper ? _result.ToUpper() : _result;
/*
if ( char.IsSurrogatePair( description, 0 ) )
{
return description.Substring( 0, 2 ).ToUpper();
}
else
{
return description.Substring( 0, 1 ).ToUpper();
}
*/
}
return "?";
}
This passes some important unit tests. Namely, it correctly handles the England Emoji flag emoji, the 🏴 (🏴󠁧󠁢󠁥󠁮󠁧󠁿), which still is a single visible sign but in this extreme case it's the first 14 characters of the C# string! I believe there's still a room for improvement (and possibly other strange cases that I still miss).
1 comment:
Great content! It’s inspiring to see innovative approaches and dedication to improving learning experiences. Thanks for sharing this valuable perspective!
NEBOSH Course in Chennai
Post a Comment