Name matching using FuzzyWuzzy in Python

Introduction

This document will explain how to name match using FuzzyWuzzy on Python.

FuzzyWuzzy is a library of Python which is used for string matching. It can be used to match well names used in Corva and a customer’s system.

An example of mismatching names: “J. Leviathon 19-18HC #1 ALT” and “J Leviathon 19-18HC No. 1-ALT”. FuzzyWuzzy helps to identify that these two names refer to the same well.

To use fuzzywuzzy library add it to requirements.txt

fuzzywuzzy~=0.18.0

The function below takes two arguments:

  • a pattern string (Corva’s well name)

  • a list of candidates for matching (customer’s well names)

A return value is the index of a name from the list of candidates that matches the pattern in the best way.

def name_matching(name: str, candidates: List) -> int:
   ratios = process.extractWithoutOrder(name, candidates)
   ratios = list(ratios)
   highest_ratio = ratios[0][1]
   highest = ratios[0][0]
   highest_ratio_index = 0
   for i in range(1, len(ratios)):
       if ratios[i][1] > highest_ratio:
           highest_ratio_index = i
           highest_ratio = ratios[i][1]
           highest = ratios[i][0]

   if highest_ratio <= 67:
       Logger.info(f"Warning: The probability that '{name}' \
               and '{highest}' is the same well is <= 67: {highest_ratio}")
   return highest_ratio_index

Name matching usage

An example of usage name_matching usage

corva_well_name = "J Leviathon 19-18HC No. 1-ALT" 
# data for multiple wells
customer_data = [
    {
        "id": 433,
        "name": "Blanco 777H",
        "date_created": "2020-05-10T18:11:42Z",
        "data": {}
    },
    {
        "id": 843,
        "name": "White 1",
        "date_created": "2021-06-11T18:09:17Z",
        "data": {}
    },
    {
        "id": 845,
        "name": "J. Leviathon 19-18HC #1 ALT",
        "date_created": "2021-06-11T18:09:17.644528Z",
        "data": {}
    }
]
# create a list of candidates
candidates = [item["name"] for item in customer_data]
# find the index of the best match
best_match_index = name_matching(corva_well_name, candidates)
# extract data for the target well
data = customer_data[best_match_index]["data"]

Name matching function

This function can be used to match any other entities (e.g. names of fields, properties, etc).
It writes to the log a warning if the probability that the best match refers to the same entity as the pattern is less than 67%.