Simil: An algorithm to look for similar strings

by Tom van Stiphout10. April 2011  

Introduction

Are you a perfect speller? Is everyone in your company? How about your business partners? Misspellings are a fact of life. There are also legitimate differences in spelling: what Americans call rumors, the British call rumours. Steven A. Ballmer and Steve Ballmer are two different but accurate forms of that man’s name. Your database may contain a lot of legacy values from the days before better validation at the point of data entry.

Overall, chances are your database already contains imperfect textual data, which makes it hard to search. Additionally, the user may not know exactly what to look for. When looking for a number or a date, we could search for a range, but text is more unstructured, so database engines such as SQL Server include a range of tools to find text, including the following:

  • EQUALS (=) and LIKE
  • SOUNDEX and DIFFERENCE
  • CONTAINS and FREETEXT
  • Simil
If you are working with an Access back-end database, only the first and last option are available. Soundex could be made available with a VBA module.

Equals and LIKE search for equality with or without wildcards. SOUNDEX uses a phonetic algorithm based on the sound of the consonants in a string. CONTAINS is optimized for finding inflectional forms and synonyms of strings. Simil is an algorithm that compares two strings, and based on the longest common substrings, computes a similarity between 0 (completely different) and 1 (identical). This is sometimes called fuzzy string matching. Simil isn’t available by default. Later in this chapter we’ll discuss how to install it.

Simil

T-SQL allows us to perform a wide range of text searches. Still, a lot remains to be desired, especially with regard to misspellings. If you want to find a set of records even if they have misspellings, or want to prevent misspellings, you need to perform fuzzy string comparisons, and Simil is one algorithm suited for that task.

One use for Simil is in data cleanup. In one example, a company had a table with organic chemistry compounds, and their names were sometimes spelled differently. The application presents the user with the current record and similar records. The user can decide which records are duplicates, and choose the best one. One button click later, all child records are pointed to the chosen record, and the bad records are deleted. Then the user moves to the next record.

Another typical use for Simil is in preventing bad data from entering the database in the first place. Our company has a Sales application with a Companies table. When a salesperson is creating or importing a new company, the application uses Simil to scan for similar company names. If it finds any records, it’ll show a dialog box asking the user if the new company is one of those, or indeed a new company, as shown in figure 1.

similar companies form

Other uses include educational software with open-ended questions. One tantalizing option the original authors mention is to combine Simil with a compiler, which could then auto-correct common mistakes.

Let’s look at Simil in more detail, and learn how we can take advantage of it. In 1988, Dr. Dobb’s Journal published the Ratcliff/Obershelp algorithm for pattern recognition (Ratcliff and Metzener, “Pattern Matching: The Gestalt Approach”). This algorithm compares two strings and returns a similarity between 0 (completely different) and 1 (identical). Ratcliff and Obershelp wrote the original version in assembly language for the 8086 processor. In 1999, Steve Grubb published his interpretation in the C language. This is the version I used as a starting point for the DLL and .NET implementations I’m presenting here.

The purpose of Simil is to calculate a similarity between two strings.

Algorithm

The Simil algorithm looks for the longest common substring, and then looks at the right and left remainders for the longest common substrings, and so on recursively until no more are found. It then returns the similarity as a value between 0 and 1, by dividing the sum of the lengths of the substrings by the lengths of the strings themselves.

Table 1 shows an example for two spellings of the word Pennsylvania. The algorithm finds the largest common substrings lvan, and then repeats with the remaining strings until there are no further common substrings.

Word 1Word 2Common substringLength
PennsylvaniaPencilvaneyalvan8
Pennsy    iaPenci    eyaPen6
   nsy    ia   ci    ey a2
   nsy    i    ci    ey (none)0
Subtotal  16
Length of original strings  24
Simil = 16/24  0.67

Simil is case sensitive. If you want to ignore case, convert both strings to uppercase or lowercase before calling Simil.

At its core, Simil is a longest common substring or LCS algorithm, and its performance can be expected to be on par with that class of algorithms. Anecdotally, we know that using Simil to test a candidate company name against 20,000 company names takes less than a second.

Simil has good performance and is easy to understand. It also has several weaknesses, including the following:

  • The result value is abstract. Therefore it’ll take some trial and error to find a good threshold value above which you’d consider two strings similar enough to take action. For data such as company names, I recommend a starting Simil value of about 0.75. For the organic chemistry names, we found that 0.9 gave us better results.
  • It’s insensitive for very small strings. For example, Adams and Ramos have three out of five characters in common, so the Simil value is 0.6. Most people wouldn’t call those names similar.
  • It treats every letter the same, without regard for vowels or consonants, or for letters that often occur together, or for the location in the string, or any other criteria. Some other algorithms do; for example, in the English language the letters Q and U nearly always occur together and in that order, so much so that they could almost be considered a single letter. In a more comprehensive algorithm, such occurrences could be given special consideration. SOUNDEX is another algorithm that does take into account that some consonants are almost the same (for example, d and t).
  • Simil can’t be precalculated, always requires a table scan, and can’t take advantage of indexes. This may be a problem for large datasets.

Implementation in .NET

Several years ago I used the C version from Steve Grubb to create a classic Windows DLL that was called from the business layer of an application, and it has served me well. This DLL is available in the download package. We'll discuss it below in the section on Access/VBA programming.

In a search for higher levels of performance, I rewrote the code for .NET in two ways. The first is a straight port from C to VB.NET; the second is a pure .NET interpretation. Why two ways? When a new development platform comes out, some developers stay with what they know and mold the platform to their way of programming, while others go with the flow and change their way of programming to what the platform has to offer. I was curious to find out which approach would yield the best performance.

The straight port is available in the Simil method of the clsSimil class in SimilCLR.dll. The pure .NET version is available in the Simil method of the RatcliffObershelp class in SimilCLR.dll. This version is the one we’re using in the next section. To me, it was gratifying to find out that the pure .NET version performed 30 percent better than the straight port.

Installation in SQL Server

SimilCLR.dll is a .NET assembly. An assembly is a unit of execution of a .NET application. SQL Server 2005 introduced the ability to run .NET assemblies in the SQL Server process space. Running inside of the SQL Server process offers performance benefits over the previous method of calling an extended stored procedure. If you’re using an older version of SQL Server, I suggest using the classic DLL from your client or middle-tier code. All code modules discussed here can be downloaded from the link at the bottom of this article.

Because they can pack tremendous power, by default SQL Server doesn’t allow .NET assemblies to run. To enable this capability, use the following:

EXEC sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO

Please note that this is a server-wide setting.

Next copy SimilCLR.dll to a folder of your choice on the database server machine. To register an assembly with SQL Server, use the following:

CREATE ASSEMBLY asmSimil
AUTHORIZATION dbo
FROM N'C:\Windows\SimilCLR.dll' --Enter your path.
WITH PERMISSION_SET = SAFE;
GO

Once the assembly is registered, we need to make its methods accessible to T-SQL. This code creates a scalar function that takes the two strings to be compared, calls the Simil method in the assembly, and returns the Simil value for them:

CREATE FUNCTION dbo.fnSimil(@s1 nvarchar(max), @s2 nvarchar(max))
RETURNS float WITH EXECUTE AS CALLER
AS
EXTERNAL NAME asmSimil.[SimilCLR.RatcliffObershelp].Simil

In the next section, we’ll use this function to run the Simil algorithm.

Usage

The simplest use of this function, as shown in listing 1, is a procedure that takes a pair of strings and returns the result through the output parameter.

Listing 1 Calling the fnSimil() function from a stored procedure

CREATE PROCEDURE dbo.spSimil
  @str1 nvarchar(max),
  @str2 nvarchar(max),
  @dblSimil float output
AS
SET NOCOUNT ON
SELECT @dblSimil = dbo.fnSimil(@str1, @str2)
RETURN

You can call this procedure like this:

DECLARE @dblSimil float
EXEC dbo.spSimil 'some string', 'some other string', @dblSimil OUTPUT
SELECT @dblSimil --0.786

A more powerful use of the function, shown in listing 2, is where you search an entire table for similar strings, only returning those more similar than some threshold value. This procedure returns all Person records where the Person’s name is more similar to the given name than a certain threshold.

Listing 2 Using the fnSimil() function to search an entire table

CREATE PROCEDURE dbo.spSimil_FirstNameLastName
  @str1 nvarchar(max),
  @threshold float
AS
SET NOCOUNT ON
SELECT *
FROM (SELECT dbo.fnSimil(@str1, Person.Person.FirstName + N' ' +
➥Person.Person.LastName) AS Simil, * FROM Person.Person) AS T
WHERE T.Simil >= @threshold
ORDER BY T.Simil DESC;

This procedure can be called like this:

EXEC dbo.spSimil_FirstNameLastName N'John Adams', 0.75

A query like this can be used to ensure that only genuinely new persons are added to the database, and not simple misspellings.

Using Simil from a pure Access application

What if you have an Access back-end rather than SQL Server? Fortunately there is a solution for you as well. SimilDll.dll is in the download package and it can be used from any VBA application. There are a couple of steps to making it work.

Since SimilDll.dll is a classic dll, you need to declare it in a standard module:

Public Declare Function fnSimil Lib "simildll" Alias "_simil@8"
➥(ByVal strOne As String, ByVal strTwo As String) As Double

There is no need to set a reference. The DLL should be in the Path, typically in the windows\system32 folder (if on 32-bit Windows) or windows\syswow64 folder (if on 64-bit Windows). You can call this function just like any other:

MsgBox "The simil value between Pennsylvania and Pencilvaneya is "
➥& fnSimil("Pennsylvania", "Pencilvaneya")

The sample application in the download package has a Companies form. When you add a new record and attempt to save it, it will open modal form SimilarCompanies from the Form_BeforeUpdate event, passing in the CompanyName you were trying to add via the OpenArgs argument. SimilarCompanies then uses the simil function to find all companies similar to the new one. It does this by setting its RecordSource property to "select * from tblCompanies where CompanyID in (...) order by simil desc".

The in-clause is given by whatever is returned by the GetSimilarCompanies function. It is doing the heavy lifting by opening a recordset on the Companies table, looping over the records and calling simil to calculate the similarity between the CompanyName in the recordset and the CompanyName of the new record. If the value is greater than some threshold value (approx. 0.75 will do nicely), the CompanyID is added to the in-clause.

A user reported problems using simil from an Access query: select fnSimil([CompanyName], "Candidate Company") from tblCompanies. This query would only return 1 or 0 for simil. It turns out Access queries convert all text to Unicode, the super ascii table with 2 bytes per character. This caused simil to only consider the first character of the strings. The latest version of simildll (1.0.0.2) has a new function to deal with this:

Public Declare Function fnSimil_w Lib "simildll" Alias "_simil_w@8"
➥(ByVal strOne As String, ByVal strTwo As String) As Double

The "w" in the name stands for "wide char", the official name of a unicode character in C. Now we can write the query: select fnSimil_w([CompanyName], "Candidate Company") from tblCompanies

Some finer points

There are a few details worth noting about the sample application. The first one is that of using a modal dialog. We set its Modal property to True, and open it with DoCmd.OpenForm WindowMode:=acDialog. No more code will execute in this procedure until the modal form closes. However there is a loophole: if the form becomes invisible, it also "falls out of the modal loop" and the next line of code in the calling procedure will execute. When we use a button, or close the form with the X button, the calling form wants to know which option was chosen. If we allowed the form to close that information would be lost unless we first saved it in a global variable. That would work, but is not very elegant and self-contained. In this sample application we chose to keep the form running invisibly, and offer a public property DialogResult for the caller to find out what the status is. So this explains why we wrote:

DoCmd.OpenForm "SimilarCompanies", , , , , acDialog, Me.CompanyName
strDialogResult = Form_SimilarCompanies.DialogResult

In our sample application, the DialogResult can be either "New", "Cancel", or a numeric CompanyID. If it is "New", we do nothing, which causes the Form_BeforeUpdate event to complete and the record to be saved. If it is "Cancel", we set the Cancel argument of Form_BeforeUpdate to True, which stops the record from being saved. If it is a number, the user must have clicked the "Use Existing" button, and we jump to that record using the Bookmark technique.

What if SimilarCompanies found zero records that are similar to the new CompanyName? It is not very useful to show an empty form, so we want to close it and signal that the record can be saved. I would like to write:

If Me.RecordsetClone.RecordCount = 0 Then
  DialogResult = "New"
  Me.Visible = False

Unfortunately, that does not work. Setting the Visible property is being ignored in the Form_Open event. This is where we use another trick: postpone setting this property until a later time. We could have done this several ways; I chose to start a timer, and in the Timer event set Visible to False. The benefit of the timer event is that it is a low-priority event and occurs after all other events (for example Form_Current, Form_Activate) have run and things are settling down.

One last item to cover is this: with all the effort we are putting into making SimilarCompanies run invisibly, how do we actually close it? Without special code it would see any attempt to close the form as a reason to set DialogResult to "Cancel" and Cancel the Form_Unload event. We need this code in place because this is how we signal Cancel when the user closes the form with the X button. The answer is for the calling procedure to set the DialogResult property to a special value which signals "OK to close".

Summary

In this blog post we discussed the Simil algorithm and showed several ways to use it in your applications. The free download includes all DLLs discussed here, as well as the Access 2007 sample application.

More source code and a more comprehensive version of this article was published in this book: SQL Server MVP Deep Dives, by Manning Publications.

Crystal Long and myself produced a video about Simil here.

You can download a zip file with Simil code and an MS-Access sample application here.

About the author

Tom van Stiphout bio goes here.