Cookbook¶
Matchtools is a package written to streamline data matching and integration processes. This document contains an overview of the package’s functionalities. It is divided into three parts:
The first part presents how to use the package to perform some specific data operations. Integrating data from different sources often requires similar set of normalising steps - standardising numbers, removing linguistic or conventional differences etc. Matchtools package provides functionalities to automate that.
The second part introduces the matchtools methodology of finding matching records in different sets. It includes two examples of how to use the package to compare single records - e.g. checking whether two strings or lists can be considered as equal. It also shows how to set up different types of tolerance.
The third part shows how to use matchtools to find matching records within nested lists.
>>> from matchtools import *
1. Data manipulation¶
Working with roman numerals
Convert roman numerals to numbers:
>>> caesar = 'Pro multitudine autem hominum et pro gloria belli atque fortitudinis angustos se fines habere arbitrabantur, qui in longitudinem milia passuum CCXL, in latitudinem CLXXX patebant.'
>>> MatchBlock.roman_to_integers(caesar)
'Pro multitudine autem hominum et pro gloria belli atque fortitudinis angustos se fines habere arbitrabantur, qui in longitudinem milia passuum 240, in latitudinem 180 patebant.'
Convert numbers to roman numerals:
>>> caesar = 'They thought, that considering the extent of their population, and their renown for warfare and bravery, they had but narrow limits, although they extended in length 240, and in breadth 180 [Roman] miles.'
>>> MatchBlock.integers_to_roman(caesar)
'They thought, that considering the extent of their population, and their renown for warfare and bravery, they had but narrow limits, although they extended in length CCXL, and in breadth CLXXX [Roman] miles.'
Working with zeros
One of the most common data manipulation tasks when working on data integration. It is often the case that ‘Record 1’ and ‘Record 0001’ refer to the same object:
>>> record_with_zeros = 'ABC 001 DEF 2 GHI 03'
>>> MatchBlock.strip_zeros(record_with_zeros)
'ABC 1 DEF 2 GHI 3'
Checking if a word is an abbreviation of another word
>>> MatchBlock.is_abbreviation('Federal Bureau of Investigation', 'FBI')
True
Replacing words with their standardised forms
When integrating data coming from different sources, some linguistic, spelling or conventional inconsistencies are likely to occur. Such situation often takes place when data contain cardinal directions. For example, it probably makes sense to standardize vest, w, as well as zapad (rus.) or ouest (fr.) as west. The package uses a predefined set of standardised forms, see the documentation of MatchBlock.dict_sub to learn how to provide a user-defined one.
>>> MatchBlock.dict_sub("there's a feeling I get when I look to the W")
"there's a feeling I get when I look to the west"
Moving elements of a string
Matchtools package contains functions to move text elements to the beginning or the end of a string. You can specify the element to move by its name or position:
>>> move_element_to_front('London E1 United Kingdom', 1)
'E1 London United Kingdom'
>>> move_element_to_back('London E1 United Kingdom', 'E1')
'London United Kingdom E1'
Example: standardising names in pandas DataFrame
from matchtools import MatchBlock, move_element_to_back
import pandas as pd
input_data = [('nord IV N1'), ('west 3 W001'), ('e 02 E01'), ('sud 1 S1')]
df = pd.DataFrame(input_data, columns = ['Name'])
def standardize(element):
element = move_element_to_back(element, 1)
element = MatchBlock.dict_sub(element)
element = MatchBlock.roman_to_integers(element)
element = MatchBlock.strip_zeros(element)
return element
df['Standardized'] = df.apply(lambda row: standardize(row['Name']), axis=1)
print(df)
Name Standardized
0 nord IV N1 north N1 4
1 west 3 W001 west W1 3
2 e 02 E01 east E1 2
3 sud 1 S1 south S1 1
2. Compare single records¶
Specifying tolerance
Specifying tolerance for each data type is a crucial part of the process. This is how we define what similarity criteria two MatchBlock objects must fulfil in order to be considered as equal. MatchBlock class allows the following tolerances:
property name | description |
---|---|
number_tolerance | expressed in numbers. No maximum value. Default: 0. |
date_tolerance | expressed in numbers (days). No maximum value. Default: 0. |
coordinates_tolerance | expressed in numbers (kilometers, see the documentation MatchBlock.compare_coordinates to learn how to use different units. No maximum value. Default: 0. |
string_tolerance | expressed in numbers (Levenshtein distance when calculating uwratio from fuzzywuzzy package, see the documentation of MatchBlock.compare_strings to learn how to use different algorithms). Maximum value: 100. Default: 0 |
str_number_tolerance | Same as string_tolerance. Used only for the numeric components of a string |
Example 1: Comparing single MatchBlock objects
This is a basic example of matchtool’s main functionality. It shows how to determine whether two string objects are the same, given the tolerances specified.
Note
Comparing two MatchBlock objects triggers the following data manipulation methods on both of them, there’s no need to execute them before the comparison: roman_to_integers, strip_zeros, is_abbreviation, dict_sub.
>>> object1 = MatchBlock('WOJCIOW 11 DEV 07-NOV-86')
>>> object2 = MatchBlock('WOJCIOW 12 DEV 01-NOV-86')
>>> object1
<MatchBlock object at 0x105d43080: date: 1986-11-07, string: WOJCIOW DEV, string (number part): 11>
>>> object2
<MatchBlock object at 0x1063c1e80: date: 1986-11-01, string: WOJCIOW DEV, string (number part): 12>
We created two MatchBlock objects. We can see how the input string has been split into date, text and text-number components. Now, let’s set some tolerance values and perform a comparison:
>>> MatchBlock.date_tolerance = 7
>>> MatchBlock.number_tolerance = 0
>>> MatchBlock.str_number_tolerance = 0
>>> object1 == object2
False
We can see that the objects are considered as different. While the tolerance set for dates is probably high enough it looks that there is still too much difference in the numeric components of the objects:
>>> MatchBlock.number_tolerance = 1
>>> object1 == object2
False
Still false. This is because the numeric parts of the objects come from a string, not an integer or float. Therefore we need to specify the str_number_tolerance appropriately. A thing to remember, str_number_tolerance is a Levenshtein distance tolerance. That’s why setting it to 1 wouldn’t be enough in this case. We use number_tolerance when working with numbers that are not extracted from strings and that tolerance is simply a distance between numbers in integers. The next section includes such objects.
>>> MatchBlock.str_number_tolerance = 50
>>> object1 == object2
True
Example 2: Comparing two lists
In a real work situation you will probably want to perform more complex analysis. For example, you may want to determine whether a record from Database A is equal to a record from Database B. This can be achieved with match_rows function.
>>> record_1 = ['London 1', 5, '1 May 2015']
>>> record_2 = ['London_01', 10, '2015-05-01']
>>> MatchBlock.number_tolerance = 10
>>> match_rows(record_1, record_2)
True
3. Compare multiple records¶
Matchtools include two functions to perform matching on a list of records:
- match_find takes an input record, compares it to a set of records and returns the first matching object
- match_find_all does the same but returns a list of all matching objects
from matchtools import MatchBlock, match_find, match_find_all
MatchBlock.number_tolerance = 10
MatchBlock.date_tolerance = 5
MatchBlock.coordinates_tolerance = 0
MatchBlock.string_tolerance = 0
MatchBlock.str_number_tolerance = 0
record_1 = ['Flight 3', 5, '1 May 2015', '52.3740300, 4.8896900']
records = [['Flight 1', 0, '3 May 2015', '52.3740300, 4.8896900'],
['Flight 2', 5, '4 May 2016', '52.3740300, 4.8896900'],
['Flight 3', 10, '5 May 2015', '52.3740300, 4.8896900'],
['Flight 3', 15, '6 May 2015', '52.3740300, 4.8896900']]
>>> match_find(record_1, records)
['Flight 3', 10, '5 May 2015', '52.3740300, 4.8896900']
>>> match_find_all(record_1, records)
[['Flight 3', 10, '5 May 2015', '52.3740300, 4.8896900'], ['Flight 3', 15, '6 May 2015', '52.3740300, 4.8896900']]