API

class matchtools.MatchBlock(entry, *, try_date=True, try_coordinates=True, try_str_number=True, try_str_custom=True, convert_roman=True)

Core class that contains all methods for data extraction, processing, and comparison.

classmethod compare_coordinates(coords1, coords2, *args, tolerance=None, unit='km', **kwargs)

Check if a distance between the pairs of coordinates provided is within the specified tolerance.

Return True if yes, otherwise return False.

Use geopy (https://pypi.python.org/pypi/geopy).

Try to use Vincenty formula, if error occurs use Great Circle formula.

Parameters:
  • coords1 – pair of coordinates - a tuple of two numbers
  • coords2 – pair of coordinates - a tuple of two numbers
  • tolerance – number
  • unit – str, one of: ‘kilometers’, ‘km’, ‘meters’, ‘m’, ‘miles’, ‘mi’, ‘feet’, ‘ft’, ‘nautical’, ‘nm’
Return type:

bool

Example:
>>> a, b = (36.1332600, -5.4505100), (35.8893300, -5.3197900)
>>> MatchBlock.compare_coordinates(a, b, tolerance=20, unit='mi')
True
>>> a, b = (36.1332600, -5.4505100), (35.8893300, -5.3197900)
>>> MatchBlock.compare_coordinates(a, b, tolerance=1, unit='mi')
False
classmethod compare_dates(date1, date2, *, tolerance=None, pattern='%d-%b-%Y')

Check if the dates provided are within the specified tolerance.

Return True if yes, otherwise return False.

If lists of dates are provided check if they are of the same length.

If yes, check whether the difference between each element of list1 and the corresponding element of list2 is within the specified tolerance.

Parameters:
  • date1 – datetime.datetime object or a list of such objects
  • date2 – datetime.datetime object or a list of such objects
  • tolerance – number
  • pattern – str
Return type:

bool

Example:
>>> date1 = datetime.datetime(2005, 5, 25, 0, 0)
>>> date2 = datetime.datetime(2005, 5, 26, 0, 0)
>>> MatchBlock.compare_dates(date1, date2, tolerance=1)
True
>>> date1 = [datetime.datetime(2005, 5, 26, 0, 0)]
>>> date2 = [datetime.datetime(2006, 5, 25, 0, 0)]
>>> MatchBlock.compare_dates(date1, date2, tolerance=1)
False
classmethod compare_numbers(number1, number2, *, tolerance=None)

Check if the numbers provided are within the specified tolerance.

Return True if yes, otherwise return False.

Parameters:
  • number1 – number
  • number2 – number
  • tolerance – number
Return type:

bool

Example:
>>> MatchBlock.compare_numbers(1, 10, tolerance=10)
True
>>> MatchBlock.compare_numbers(1, 10, tolerance=5)
False
classmethod compare_strings(string1, string2, *, tolerance=None, method='uwratio')

Check if the strings provided have a similarity ratio within the specified tolerance.

Return True if yes, otherwise return False.

Use fuzzywuzzy (https://pypi.python.org/pypi/fuzzywuzzy).

Parameters:
  • string1 – str
  • string2 – str
  • tolerance – number
  • method – str, one of: ‘uwratio’, ‘partial_ratio’, ‘token_sort_ratio’, ‘token_set_ratio’, ‘ratio’
Return type:

bool

Example:
>>> MatchBlock.compare_strings('Beatles', 'The Beatles', tolerance=10)
True
>>> MatchBlock.compare_strings('AB', 'AC', tolerance=0, method='ratio')
False
classmethod dict_sub(string, dictionary_file=None)

Substitute string values with values from a dictionary.

Replace part of the string, separated by non-alphanumeric characters, with a key found in a dictionary, if the string part is contained within values of the dictionary’s key.

The dictionary must be stored in the JSON format. Use the file provided with the package by default.

Parameters:
  • string – str
  • dictionary_file – str
Return type:

str

Example:
>>> MatchBlock.dict_sub('S Africa')
'south Africa'
classmethod extract_coordinates(string)

Extract pair of coordinates (latitude and longitude, separated by comma) from a string.

If found, return remains of original string and tuple with coordinates, otherwise return original string and None.

Parameters:string – str
Return type:tuple
Example:
>>> MatchBlock.extract_coordinates('Washington 38.8897, -77.0089')
('Washington', (38.8897, -77.0089))
>>> MatchBlock.extract_coordinates('55.7522200,37.6155600')
('', (55.75222, 37.61556))
>>> MatchBlock.extract_coordinates('Richmond 123.45')
('Richmond 123.45', None)
classmethod extract_dates(string)

Extract dates from text.

Use datefinder (https://pypi.python.org/pypi/datefinder).

Parameters:string – str
Return type:tuple
Example:
>>> MatchBlock.extract_dates('Istanbul 25 May 2005 ')
('Istanbul', [datetime.datetime(2005, 5, 25, 0, 0)])
classmethod extract_str_custom(string, dictionary_file=None)

Extract all custom values found in a string.

Look up the values in the supplied dictionary’s keys. First prepare the string by substituting values found in the dictionary’s values with corresponding key using dict_sub function.

Return string and its separated custom parts.

Parameters:
  • string – str
  • dictionary_file – str
Return type:

tuple

Example:
>>> MatchBlock.extract_str_custom('East Timor')
('Timor', 'east')
>>> MatchBlock.extract_str_custom('Sud Ouest France')
('France', 'south west')
classmethod extract_str_number(string)

Extract all numeric elements found in a string. Consider an element to be a numeric if it contains at least one digit.

Return string and its separated numeric parts.

Parameters:string – str
Return type:tuple
Example:
>>> MatchBlock.extract_str_number('Jamaica 1')
('Jamaica', '1')
>>> MatchBlock.extract_str_number('Jamaica 1X')
('Jamaica', '1X')
>>> MatchBlock.extract_str_number('Jamaica')
('Jamaica', '')
classmethod from_roman(string)

Convert the whole input string from roman to arabic numeral.

Use roman (https://pypi.python.org/pypi/roman).

Return result if conversion was successful, otherwise return input.

Parameters:string – str
Return type:str
Example:
>>> MatchBlock.from_roman('VII')
'7'
>>> MatchBlock.from_roman('XIIIIX')
'XIIIIX'
>>> MatchBlock.from_roman('ABC')
'ABC'
classmethod integers_to_roman(string)

Convert all integers within the string into roman numerals.

Recognise integers separated by non-alphanumeric characters.

Parameters:string – str
Return type:str
Example:
>>> MatchBlock.integers_to_roman('LIV 4 DOR 3')
'LIV IV DOR III'
classmethod is_abbreviation(string1, string2)

Check whether one string is an abbreviation of the other.

Parameters:
  • string1 – str
  • string2 – str
Return type:

bool

Example:
>>> MatchBlock.is_abbreviation('Federal Bureau of Investigation', 'FBI')
True
classmethod roman_to_integers(string)

Convert all roman numerals within the string into integers.

Recognise roman numerals separated by non-alphanumeric characters.

Parameters:string – str
Return type:str
Example:
>>> MatchBlock.roman_to_integers('IV ABC II')
'4 ABC 2'
classmethod split_on_nonalpha(string, return_all=True)

Split the input string into a list of alphanumeric and non-alphanumeric components.

If return_all is False return list with alphanumeric components of the string only.

Parameters:
  • string – str
  • return_all – bool
Return type:

list

Example:
>>> MatchBlock.split_on_nonalpha('F.C. Liverpool', return_all=True)
['F', '.', 'C', '. ', 'Liverpool']
>>> MatchBlock.split_on_nonalpha('F.C. Liverpool', return_all=False)
['F', 'C', 'Liverpool']
classmethod strip_zeros(string)

Strip leading zeros in any number longer than one digit found in a string.

Parameters:string – str
Return type:str
Example:
>>> MatchBlock.strip_zeros('Agent 007')
'Agent 7'
matchtools.return_element(word, element)

Split word and return the index of element.

If element occurs more than once in the word, an index of the first instance is returned.

Parameters:
  • word – str
  • element – str
Return type:

int

Example:
>>> return_element('South America', 'America')
1
matchtools.match_rows(row1, row2)

Compare rows by transforming each pair of values into MatchBlock objects and perform equality check on them.

The rows are considered to match if all checks result in True.

Parameters:
  • row1 – list, tuple
  • row2 – list, tuple
Return type:

bool

Example:
>>> row1 = ['Flight 1', 5, '1 May 2015']
>>> row2 = ['Flight 01', 5, '2015-05-01']
>>> match_rows(row1, row2)
True
>>> row1 = ['Flight 2', 5, '1 May 2015']
>>> row2 = ['Flight 02', 6, '2015-05-01']
>>> match_rows(row1, row2)
False
matchtools.match_find(row, rows)

Search list of rows and return first successful match with the input row.

Parameters:
  • row – list, tuple
  • rows – nested list, nested tuple
Return type:

list

Example:
>>> row = ['Flight 3', 100]
>>> rows = [['Flight 1', 100], ['Flight 2', 100], ['Flight 3', 100]]
>>> match_find(row, rows)
['Flight 3', 100]
matchtools.match_find_all(row, rows)

Search list of rows and return all successful matches with the input row.

Parameters:
  • row – list, tuple
  • rows – nested list, nested tuple
Return type:

list

Example:
>>> row = ['Flight 2', 100]
>>> rows = [['Flight 1', 100], ['Flight 2', 100], ['Flight 2', 100]]
>>> match_find_all(row, rows)
[['Flight 2', 100], ['Flight 2', 100]]
matchtools.move_element_to_front(word, element)

Move element of a word to front.

If a string is used as element, return_element function is triggered to determine the position of element within the word.

If an integer is used as element, it must reflect the position of element within the word split by non-alphanumeric character e.g. ‘Block-A 1’ -> [‘Block, ‘A’, ‘1’]

The function converts all sequences of non-alphanumeric characters into single whitespaces.

Parameters:
  • word – str
  • element – str (gets converted to int by return_element()) or int
Return type:

str

Example:
>>> move_element_to_front('A B C', 2)
'C A B'
matchtools.move_element_to_back(word, element)

Move element of a word to back.

If a string is used as element, return_element function is triggered to determine the position of element within the word.

If an integer is used as element, it must reflect the position of element within the word split by non-alphanumeric character e.g. ‘Block-A 1’ -> [‘Block, ‘A’, ‘1’]

The function converts all sequences of non-alphanumeric characters into single whitespaces.

Parameters:
  • word – str
  • element – str (gets converted to int by return_element()) or int
Return type:

str

Example:
>>> move_element_to_back('A B C', 0)
'B C A'