BlackWaspTM

This web site uses cookies. By using the site you accept the cookie policy.This message is for compliance with the UK ICO law.

Algorithms and Data Structures
.NET 2.0+

The Soundex Algorithm

Cultural differences and input errors can lead to words being spelled differently to a user's expectations. This makes it difficult to locate information quickly. The Soundex algorithm can alleviate this by assigning codes based upon the sound of words.

Adding Soundex Character Codes

The GetSoundex method calls several private methods that have yet to be defined. The first is the AddCharacter method, which encodes a letter as a Soundex character and appends it to the code. The first letter is copied to the Soundex string; subsequent letters are converted to digits first and added only if they are not duplicates of the previous digit.

private void AddCharacter(StringBuilder soundex, char ch)
{
    if (soundex.Length == 0)
        soundex.Append(ch);
    else
    {
        string code = GetSoundexDigit(ch);
        if (code != soundex[soundex.Length - 1].ToString())
            soundex.Append(code);
    }
}

Determining Soundex Digits

The GetSoundexDigit method encodes letters as digits. The letter is converted to a value between one and six according to the algorithm rules. If the letter is not encodable, a full stop (period) character is used as a placeholder. The placeholders ensure that duplicates are not removed when separated by a vowel, H, W or Y.

private string GetSoundexDigit(char ch)
{
    string chString = ch.ToString();

    if ("BFPV".Contains(chString))
        return "1";
    else if ("CGJKQSXZ".Contains(chString))
        return "2";
    else if ("DT".Contains(chString))
        return "3";
    else if (ch == 'L')
        return "4";
    else if ("MN".Contains(chString))
        return "5";
    else if (ch == 'R')
        return "6";
    else
        return ".";
}

Removing Placeholder Characters

The next method removes placeholder characters from the StringBuilder object, leaving only encoded characters. This is achieved with a call to the StringBuilder's Replace method:

private void RemovePlaceholders(StringBuilder soundex)
{
    soundex.Replace(".", "");
}

Setting the Soundex Length

The final code must be four digits long. The FixLength method pads the string with zeroes if it is too short or truncates it if it is too long:.

private void FixLength(StringBuilder soundex)
{
    int length = soundex.Length;
    if (length < 4)
        soundex.Append(new string('0', 4 - length));
    else
        soundex.Length = 4;
}

Comparing Strings

The second public method of the Soundex class compares two strings to determine if they sound alike. In this case we use a simple algorithm. Firstly, the two strings are encoded using the Soundex algorithm. Next, the pairs of characters at each of the four positions are compared. The method returns the number of matching pairs. A result of four indicates the best possible match and zero the worst possible match. These values are useful when sorting a list of possible matches with the most likely appearing first.

public int Compare(string value1, string value2)
{
    int matches = 0;
    string soundex1 = GetSoundex(value1);
    string soundex2 = GetSoundex(value2);

    for (int i = 0; i < 4; i++)
        if (soundex1[i] == soundex2[i]) matches++;

    return matches;
}

Testing the Methods

To test the algorithms we can use the Main method of the console application. The following test encodes two strings and compares their Soundex codes. Try changing the strings to see the results.

string value1 = "Smith";
string value2 = "Smythe";

Soundex soundex = new Soundex();
Console.WriteLine(soundex.GetSoundex(value1));      // Outputs "S530"
Console.WriteLine(soundex.GetSoundex(value2));      // Outputs "S530"
Console.WriteLine(soundex.Compare(value1, value2)); // Outputs "4"

Variations

There are variations on the Soundex algorithm, which makes it difficult to compare Soundex codes generated by different systems. One variation is to identify when the first few consonants of a word are encoded as the same numeric digit and remove this duplication. We can add this modification by changing the AddCharacter method as shown below. Note the additional check for a one-character Soundex code where duplicated digits are not added.

private void AddCharacter(StringBuilder soundex, char ch)
{
    if (soundex.Length == 0)
        soundex.Append(ch);
    else if (soundex.Length == 1) 
    {
        string code = GetSoundexDigit(ch);
        string initialCode = GetSoundexDigit(soundex[0]);
        if (code != initialCode)
        soundex.Append(code);
    } 
    else
    {
        string code = GetSoundexDigit(ch);
        if (code != soundex[soundex.Length - 1].ToString())
        soundex.Append(code);
    }
}

Another common variation is to treat the letters H, W and Y differently to vowels, ignoring them completely and removing duplicate codes that are separated by one of the three letters. You can use this variation by modifying the GetSoundexDigit method as follows:

private string GetSoundexDigit(char ch)
{
    string chString = ch.ToString();

    if ("BFPV".Contains(chString))
        return "1";
    else if ("CGJKQSXZ".Contains(chString))
        return "2";
    else if ("DT".Contains(chString))
        return "3";
    else if (ch == 'L')
        return "4";
    else if ("MN".Contains(chString))
        return "5";
    else if (ch == 'R')
        return "6";
    else if ("HWY".Contains(chString))
        return "";
    else
        return ".";
}
12 February 2010