Introduction
Are you a perfect speller? Is everyone in your company? How about your business
partners? Misspellings are a fact of life. There are also legitimate differences
in spelling: what Americans call rumors, the British call rumours.
Steven A. Ballmer and Steve Ballmer are two different but
accurate forms of that man’s name. Your database may contain a lot of legacy
values from the days before better validation at the point of data entry.
Overall, chances are your database already contains imperfect textual data,
which makes it hard to search. Additionally, the user may not know exactly what to
look for. When looking for a number or a date, we could search for a range, but
text is more unstructured, so database engines such as SQL Server include a range
of tools to find text, including the following:
- EQUALS (=) and LIKE
- SOUNDEX and DIFFERENCE
- CONTAINS and FREETEXT
- Simil
If you are working with an Access back-end database, only the first and last option are available. Soundex could be made available with a VBA module.
Equals and LIKE search for equality with or without wildcards. SOUNDEX uses a phonetic
algorithm based on the sound of the consonants in a string. CONTAINS is optimized
for finding inflectional forms and synonyms of strings.
Simil is an algorithm that compares two strings, and based on the longest common
substrings, computes a similarity between 0 (completely different) and 1
(identical). This is sometimes called fuzzy string matching. Simil isn’t available by
default. Later in this chapter we’ll discuss how to install it.
Simil
T-SQL allows us to perform a wide range of text searches. Still, a lot
remains to be desired, especially with regard to misspellings. If you want to find a set
of records even if they have misspellings, or want to prevent misspellings, you need to
perform fuzzy string comparisons, and Simil is one algorithm suited for that task.
One use for Simil is in data cleanup. In one example, a company had a table with
organic chemistry compounds, and their names were sometimes spelled differently.
The application presents the user with the current record and similar records. The
user can decide which records are duplicates, and choose the best one. One button
click later, all child records are pointed to the chosen record, and the bad records are
deleted. Then the user moves to the next record.
Another typical use for Simil is in preventing bad data from entering the database
in the first place. Our company has a Sales application with a Companies table. When
a salesperson is creating or importing a new company, the application uses Simil to
scan for similar company names. If it finds any records, it’ll show a dialog box asking
the user if the new company is one of those, or indeed a new company, as shown in
figure 1.
Other uses include educational software with open-ended questions. One tantalizing
option the original authors mention is to combine Simil with a compiler, which
could then auto-correct common mistakes.
Let’s look at Simil in more detail, and learn how we can take advantage of it.
In 1988, Dr. Dobb’s Journal published the Ratcliff/Obershelp algorithm for pattern
recognition (Ratcliff and Metzener, “Pattern Matching: The Gestalt Approach”). This algorithm compares two strings and returns a
similarity between 0 (completely different) and 1 (identical). Ratcliff and Obershelp
wrote the original version in assembly language for the 8086 processor. In 1999, Steve
Grubb published his interpretation in the C language.
This is the version I used as a starting point for the DLL and .NET implementations I’m presenting here.
The purpose of Simil is to calculate a similarity between two strings.
Algorithm
The Simil algorithm looks for the longest common substring, and then looks at the
right and left remainders for the longest common substrings, and so on recursively
until no more are found. It then returns the similarity as a value between 0 and 1,
by dividing the sum of the lengths of the substrings by the lengths of the strings
themselves.
Table 1 shows an example for two spellings of the word Pennsylvania. The algorithm
finds the largest common substrings lvan, and then repeats with the remaining
strings until there are no further common substrings.
Word 1 | Word 2 | Common substring | Length |
Pennsylvania | Pencilvaneya | lvan | 8 |
Pennsy ia | Penci eya | Pen | 6 |
nsy ia | ci ey | a | 2 |
nsy i | ci ey | (none) | 0 |
Subtotal | | | 16 |
Length of original strings | | | 24 |
Simil = 16/24 | | | 0.67 |
Simil is case sensitive. If you want to ignore case, convert both strings to uppercase or
lowercase before calling Simil.
At its core, Simil is a longest common substring or LCS algorithm, and its performance
can be expected to be on par with that class of algorithms. Anecdotally, we know that
using Simil to test a candidate company name against 20,000 company names takes
less than a second.
Simil has good performance and is easy to understand. It also has several weaknesses,
including the following:
-
The result value is abstract. Therefore it’ll take some trial and error to find a
good threshold value above which you’d consider two strings similar enough to
take action. For data such as company names, I recommend a starting Simil
value of about 0.75. For the organic chemistry names, we found that 0.9 gave us
better results.
-
It’s insensitive for very small strings. For example, Adams and Ramos have three
out of five characters in common, so the Simil value is 0.6. Most people
wouldn’t call those names similar.
-
It treats every letter the same, without regard for vowels or consonants, or for
letters that often occur together, or for the location in the string, or any other
criteria. Some other algorithms do; for example, in the English language the
letters Q and U nearly always occur together and in that order, so much so that
they could almost be considered a single letter. In a more comprehensive algorithm,
such occurrences could be given special consideration. SOUNDEX is
another algorithm that does take into account that some consonants are almost
the same (for example, d and t).
-
Simil can’t be precalculated, always requires a table scan, and can’t take advantage
of indexes. This may be a problem for large datasets.
Implementation in .NET
Several years ago I used the C version from Steve Grubb to create a classic Windows
DLL that was called from the business layer of an application, and it has served me
well. This DLL is available in the download package. We'll discuss it below in the section on Access/VBA programming.
In a search for higher levels of performance, I rewrote the code for .NET in two
ways. The first is a straight port from C to VB.NET; the second is a pure .NET interpretation.
Why two ways? When a new development platform comes out, some developers
stay with what they know and mold the platform to their way of programming, while
others go with the flow and change their way of programming to what the platform has
to offer. I was curious to find out which approach would yield the best performance.
The straight port is available in the Simil method of the clsSimil class in SimilCLR.dll.
The pure .NET version is available in the Simil method of the RatcliffObershelp
class in SimilCLR.dll. This version is the one we’re using in the next section.
To me, it was gratifying to find out that the pure .NET version performed 30 percent
better than the straight port.
Installation in SQL Server
SimilCLR.dll is a .NET assembly. An assembly is a unit of execution of a .NET application.
SQL Server 2005 introduced the ability to run .NET assemblies in the SQL Server
process space. Running inside of the SQL Server process offers performance benefits
over the previous method of calling an extended stored procedure. If you’re using an
older version of SQL Server, I suggest using the classic DLL from your client or middle-tier
code. All code modules discussed here can be downloaded from the link at the bottom of this article.
Because they can pack tremendous power, by default SQL Server doesn’t allow
.NET assemblies to run. To enable this capability, use the following:
EXEC sp_configure 'clr enabled', 1
GO
RECONFIGURE
GO
Please note that this is a server-wide setting.
Next copy SimilCLR.dll to a folder of your choice on the database server machine.
To register an assembly with SQL Server, use the following:
CREATE ASSEMBLY asmSimil
AUTHORIZATION dbo
FROM N'C:\Windows\SimilCLR.dll' --Enter your path.
WITH PERMISSION_SET = SAFE;
GO
Once the assembly is registered, we need to make its methods accessible to T-SQL. This
code creates a scalar function that takes the two strings to be compared, calls the
Simil method in the assembly, and returns the Simil value for them:
CREATE FUNCTION dbo.fnSimil(@s1 nvarchar(max), @s2 nvarchar(max))
RETURNS float WITH EXECUTE AS CALLER
AS
EXTERNAL NAME asmSimil.[SimilCLR.RatcliffObershelp].Simil
In the next section, we’ll use this function to run the Simil algorithm.
Usage
The simplest use of this function, as shown in listing 1, is a procedure that takes a pair
of strings and returns the result through the output parameter.
Listing 1 Calling the fnSimil() function from a stored procedure
CREATE PROCEDURE dbo.spSimil
@str1 nvarchar(max),
@str2 nvarchar(max),
@dblSimil float output
AS
SET NOCOUNT ON
SELECT @dblSimil = dbo.fnSimil(@str1, @str2)
RETURN
You can call this procedure like this:
DECLARE @dblSimil float
EXEC dbo.spSimil 'some string', 'some other string', @dblSimil OUTPUT
SELECT @dblSimil --0.786
A more powerful use of the function, shown in listing 2, is where you search an entire
table for similar strings, only returning those more similar than some threshold value.
This procedure returns all Person records where the Person’s name is more similar to
the given name than a certain threshold.
Listing 2 Using the fnSimil() function to search an entire table
CREATE PROCEDURE dbo.spSimil_FirstNameLastName
@str1 nvarchar(max),
@threshold float
AS
SET NOCOUNT ON
SELECT *
FROM (SELECT dbo.fnSimil(@str1, Person.Person.FirstName + N' ' +
➥Person.Person.LastName) AS Simil, * FROM Person.Person) AS T
WHERE T.Simil >= @threshold
ORDER BY T.Simil DESC;
This procedure can be called like this:
EXEC dbo.spSimil_FirstNameLastName N'John Adams', 0.75
A query like this can be used to ensure that only genuinely new persons are added to
the database, and not simple misspellings.
Using Simil from a pure Access application
What if you have an Access back-end rather than SQL Server? Fortunately there is
a solution for you as well. SimilDll.dll is in the download package and it can be
used from any VBA application. There are a couple of steps to making it work.
Since SimilDll.dll is a classic dll, you need to declare it in a standard module:
Public Declare Function fnSimil Lib "simildll" Alias "_simil@8"
➥(ByVal strOne As String, ByVal strTwo As String) As Double
There is no need to set a reference. The DLL should be in the Path,
typically in the windows\system32 folder (if on 32-bit Windows) or
windows\syswow64 folder (if on 64-bit Windows). You can call this function just like
any other:
MsgBox "The simil value between Pennsylvania and Pencilvaneya is "
➥& fnSimil("Pennsylvania", "Pencilvaneya")
The sample application in the download package has a Companies form. When you add a new record
and attempt to save it, it will open modal form SimilarCompanies from the Form_BeforeUpdate event,
passing in the CompanyName you were trying to add via the OpenArgs argument. SimilarCompanies
then uses the simil function to find all companies similar to the new one. It does this by setting
its RecordSource property to "select * from tblCompanies where CompanyID in (...) order by simil desc".
The in-clause is given by whatever is returned by the GetSimilarCompanies function. It is doing
the heavy lifting by opening a recordset on the Companies table, looping over the records and calling
simil to calculate the similarity between the CompanyName in the recordset and the CompanyName of
the new record. If the value is greater than some threshold value (approx. 0.75 will
do nicely), the CompanyID
is added to the in-clause.
A user reported problems using simil from an Access query: select
fnSimil([CompanyName], "Candidate Company") from tblCompanies. This
query would only return 1 or 0 for simil. It turns out Access queries
convert all text to Unicode, the super ascii table with 2 bytes per
character. This caused simil to only consider the first character of the
strings. The latest version of simildll (1.0.0.2) has a new function to deal
with this:
Public Declare Function fnSimil_w Lib "simildll" Alias "_simil_w@8"
➥(ByVal strOne As String, ByVal strTwo As String) As Double
The "w" in the name stands for "wide char", the official name of a unicode
character in C. Now we can write the query: select fnSimil_w([CompanyName],
"Candidate Company") from tblCompanies
Some finer points
There are a few details worth noting about the sample application. The first one is that of using
a modal dialog. We set its Modal property to True, and open it with
DoCmd.OpenForm WindowMode:=acDialog. No more code will execute
in this procedure until the modal form closes. However there is a loophole: if the form becomes invisible,
it also "falls out of the modal loop" and the next line of code in the calling procedure will execute.
When we use a button, or close the form with the X button, the calling form wants to know which option
was chosen. If we allowed the form to close that information would be lost unless we first saved it in a
global variable. That would work, but is not very elegant and self-contained. In this sample application
we chose to keep the form running invisibly, and offer a public property DialogResult for the caller to
find out what the status is. So this explains why we wrote:
DoCmd.OpenForm "SimilarCompanies", , , , , acDialog, Me.CompanyName
strDialogResult = Form_SimilarCompanies.DialogResult
In our sample application, the DialogResult can be either "New", "Cancel", or a numeric CompanyID.
If it is "New", we do nothing, which causes the Form_BeforeUpdate event to complete and the record
to be saved. If it is "Cancel", we set the Cancel argument of Form_BeforeUpdate to True, which stops
the record from being saved. If it is a number, the user must have clicked the "Use Existing" button,
and we jump to that record using the Bookmark technique.
What if SimilarCompanies found zero records that are similar to the new CompanyName?
It is not very useful to show an empty form, so we want to close it and signal
that the record can be saved. I would like to write:
If Me.RecordsetClone.RecordCount = 0 Then
DialogResult = "New"
Me.Visible = False
Unfortunately, that does not work. Setting the Visible property is being ignored in the Form_Open event.
This is where we use another trick: postpone setting this property until a later time. We could have done
this several ways; I chose to start a timer, and in the Timer event set Visible to False.
The benefit of the timer event is that it is a low-priority event and occurs after all other events
(for example Form_Current, Form_Activate) have run and things are settling down.
One last item to cover is this: with all the effort we are putting into making SimilarCompanies run
invisibly, how do we actually close it? Without special code it would see any attempt to close the form
as a reason to set DialogResult to "Cancel" and Cancel the Form_Unload event. We need this code in place
because this is how we signal Cancel when the user closes the form with the X button. The answer is for
the calling procedure to set the DialogResult property to a special value which signals "OK to close".
Summary
In this blog post we discussed the Simil algorithm and showed several ways to use it in your applications.
The free download includes all DLLs discussed here, as well as the Access 2007 sample application.
More source code and a more comprehensive version of this article was published in this book:
SQL Server MVP Deep Dives, by Manning Publications.
Crystal Long and myself produced a video about Simil here.
You can download a zip file with Simil code and an MS-Access sample application here.