.NET 3.5+

Removing Diacritics from Strings

by Richard Carr, published at http://www.blackwasp.co.uk/RemoveDiacritics.aspx

Diacritics are additional decorations or glyphs that are added to letters or symbols to modify the pronunciation or meaning of a character. For some functions, such as text-based searches or string comparison, it is useful to ignore diacritical marks.

Diacritical Marks

Diacritical marks, sometime called simply diacritics, are additional marks that can be applied to letters or other symbols to change the meaning or the pronunciation of a character. The additional marks, or glyphs, are usually applied directly above or below a symbol. Common diacritics include accents (á), cedillas (ç) and umlauts (ë).

Diacritical marks are essential to a character set to ensure that words that include them can be correctly spelled. However, they can make some functionality more complicated. For example, you might need to develop an application that permits a user to search for a person by name. If you ignore the possibility that names include diacritics, the search would miss potential matches; a user searching for "Diaz" might not find a person with the surname, "Díaz", which has an accented "i".

One way to avoid this string matching problem is to remove diacritical marks from two strings that are to be compared. After the comparison, the string stripped of the additional glyphs can be discarded and the original version used. Shortly we'll see how this is possible using the .NET framework.

Unicode Normalisation

The .NET framework supports strings that are encoded using Unicode. This standard allows for alphabets of all languages to be represented, with a far wider range of character codes, or data points, than standards such as ASCII. However, the encoding used in Unicode means that symbols with the same meaning, and sometimes that look the same, can be duplicated at several data points. This causes a problem when you want to compare strings.

To solve this problem, the Unicode standard includes rules for equivalence. To check if two characters have the same meaning or appearance, you can apply a process of normalisation to the strings that contain them. After normalisation, it is a simple matter to determine if two strings are equivalent.

There are two basic forms of Unicode normalisation, known as canonical and compatible. Canonical normalisation identifies characters at different data points that have the same appearance and meaning, and replaces them with a matching character at a known data point. If there are several versions of the letter "e" with an umlaut, they will all be converted to the same code. Compatible normalisation is a similar process that looks for characters with the same meaning only, even if they look different, and replaces them with a known character's data point.

Both canonical and compatible normalisation have two variants. Characters can be fully composed or fully decomposed. When fully composed, a single character might be used to represent several elements. For example, "ë" may be considered as a single character containing the letter "e" and an umlaut diacritical mark. When using fully decomposed normalisation, each part of the character becomes a separate data point. In this case, the letter, "ë" will produce two characters in the normalised string. These are "e" and the diacritic mark. The characters are categorised according to their type. Diacritics are in the category of non-spacing marks.

Removing Diacritics

You can use the above features of the Unicode standard to remove diacritics from strings prior to performing comparisons. Firstly you need to perform normalisation on your strings, fully decomposing the characters. The string class includes a method for this purpose. Normalize has a parameter that receives the type of normalisation to be performed as a NormalizationForm enumeration value. Four options are available:

Value	Canonical/Compatible	Composed/Decomposed
FormC	Canonical	Composed
FormD	Canonical	Decomposed
FormKC	Compatible	Composed
FormKD	Compatible	Decomposed

We can perform a fully decomposed, canonical normalisation as follows. Note how the diacritics are moved into separate characters from the letters in this sample:

string initial = "ÁÂÃÄÅÇÈÉàáâãäåèéêëìíîïòóôõ";
string normal = initial.Normalize(NormalizationForm.FormKD);

foreach (char c in normal)
{
    Console.Write(c);
}

/* OUTPUT

A´A^A~A¨A°C¸E'E´a'a´a^a~a¨a°e'e´e^e¨i'i´i^i¨o'o´o^o~

*/

You can check the category of a character with the static method, GetUnicodeCharacter, which is part of the CharUnicodeInfo class. This method returns a value from the UnicodeCategory enumerated type. A diacritic character generates a NonSpacingMark result. The CharUnicodeInfo class is found in the System.Globalization namespace:

using System.Globalization;

To remove the diacritics from a string we can combine the above processes by normalising and decomposing before looping through the resultant characters and removing all non-spacing marks. You could use a for loop or foreach loop but for brevity in this case let's use a LINQ standard query operator:

string initial = "ÁÂÃÄÅÇÈÉàáâãäåèéêëìíîïòóôõ";
string normal = initial.Normalize(NormalizationForm.FormD);

var withoutDiacritics = normal.Where(
    c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark);

string final = new string(withoutDiacritics.ToArray());

Console.WriteLine(initial);
Console.WriteLine(final);

/* OUTPUT

ÁÂÃÄÅÇÈÉàáâãäåèéêëìíîïòóôõ
AAAAACEEaaaaaaeeeeiiiioooo

*/

22 September 2013