How to Find Near Duplicate Text and Recognize Name Entities in Survey Responses

Scott Sanders, Ph.D.

*Note: This is the third in a series of blog posts about the computational analysis of survey questions and responses. You’re encouraged to read the other posts for greater context.
Part I: Writing Open-Ended Survey Questions for Computational Analysis
Part II: Preparing Text for Analysis with Natural Language Toolkit (NLTK)

Get Rid of Spam and Identify Common Responses

There are several reasons why you may wish to begin an analysis by finding near duplicate text. For instance, finding duplicates can be a low effort way of weeding out spam or form letter responses. Alternatively, you may want to find the most common responses.

Our survey client wanted to know which competitor brands’ participants listed most frequently. Fortunately, the survey provided a discrete field for listing each brand name. Unfortunately, most people are terrible spellers. Whether through ignorance or apathy, participants mangled most brand names, including those of our clients! Due to the frequency of misspellings, we wanted to find and group brand names and the frequent misspellings of those names.

We’ll show you two ways to find near duplicate text—one method for individual tokens (i.e., words) and the other, for documents. We’ll start with the technique for detecting near duplicate tokens since that corresponds to our task of grouping and counting misspelled brand names.

Finding Near Duplicate Tokens

Remember that text cleaning is about standardization. In other words, we want to reduce the variation in our tokens to group and count similar items. We can use simple Python methods to do this. The below function clean_text() uses standard Python string methods to remove punctuation, format text to lowercase, and remove leading and trailing whitespaces. Use this as a starting point for standardizing one- or two-word strings, such as those including brand names.

#cleaning text to make it amenable to analysis 
import string 

def clean_text(txt):
    """
    Removes leading & trailing whitespace, formats to lowercase, removes
    punctuation.  
    
    Args: 
        txt(str): string of text to be formatted. 
    
    Returns: 
        (str): formatted string of text. 
    """
    table = str.maketrans({ch: '' for ch in string.punctuation}) 
    return txt.translate(table).lower().strip() 

# some example data
examples = ["Brand 1", "      Brand 1.  ", "#bR@aND 1!", "brand 2",
           "gizmoA", "   Giz@mo b", "gi!zmo", "g1m#o$ c!", "gizmo",
           "widgetA123", 'widg', '2jwidget', 'widget', 'widget'
           ]

print("Cleans first three elements:")
[print("Raw Text: ", i) for i in examples[:3]]
print("\n")
[print("Clean Text: ", clean_text(i)) for i in examples[:3]]

Cleans first three elements:
Raw Text:  Brand 1
Raw Text:        Brand 1.  
Raw Text:  #bR@aND 1!

Clean Text:  brand 1
Clean Text:  brand 1
Clean Text:  brand 1
[None, None, None]

Next, we’ll define a function, get_near_duplicate_keys(), to help us find the duplicates. Our function input is a Python dictionary, created with the Counter class container, which has a string (i.e., the brand name) as its key and the string’s frequency as its value. We assume that the most frequently occurring spelling is correct, which is generally a good heuristic. The second value the function takes is a similarity threshold value between 0 (lowest) and 1(highest). You should experiment with this value to achieve high performance on your own data. This function returns a dictionary of strings with a list of the near matches.

Now, let’s discuss what this function does. First, it creates a complete list of our tokens (i.e., brand names) and iterates over them. The first token, by definition, is new—a function without any previous experience with the specified token. Therefore, it adds this token to a dictionary as the key and an empty list as the value. On the next iteration, a similarity score is calculated between the second token and the first. If the computed score exceeds our threshold value, we append the token to the first token’s list. Otherwise, we add the second token to our dictionary as a key with an empty list as a value. This continues until we iterate through the entire list of tokens. Although the above approach is not particularly computationally efficient, it’s fine for survey data which is simply not that large.

Our function returns a dictionary with a list of near matches as it’s values and our best guess regarding the correctly spelled brand name as the key. By referring to our original counter dictionary, we can sum the values of all near matches to find out which brands were mentioned most frequently.

import difflib
from collections import Counter


def get_near_duplicate_keys(d, threshold =.8): 
    """
    Args: 
        d(dict): dictionary of word sorted by counts (i.e., the value) 
        threshold(flt): similiarity threshold between 0 and 1. 
    
    Returns
        key_mapping(dict): Uses key w/ highest value and has list of near
                        duplicate keys. 
    """
    
    try: # deletes "nan" which represents no input in pandas.
        del d['nan']
    except: 
        pass 
    
    
    key_mapping = {}   # initializing an empty dict for our keys. 
    key_list =  list(d.keys())  # making a list of dict keys. 

    for i in key_list:   #iterating over keys.
        if i not in key_mapping: 
            matched = False # if key is not in dict there is no match. 
            for k,v in key_mapping.items():  
                score = difflib.SequenceMatcher(None, i, k).ratio() # we compare the key to other previous keys.
                if score >=threshold:
                    key_mapping[k].append(i) # if similiarity exceeds threshold we map it to the appropriate dict key.
                    matched = True 
                    break 
            if not matched: 
                key_mapping[i] = []  #since the above is ordered by freq cnt & this is the first time we've encountered a similar key. 

    return key_mapping

# Cleaning our Examples
clean_examples = [clean_text(t) for t in examples]

# Counting up our examples. 
counted_examples = Counter(clean_examples)
print("***Counted Example Tokens***")
[print(k,":", v) for k,v in counted_examples.items()]
print('\n')

#Sorting our counter as most common will be considered correct spelling
counted_examples = dict(counted_examples.most_common())

# Finding our near duplicates 
near_dupes = get_near_duplicate_keys(counted_examples, threshold = .50) # run function on dict of brand names.
print("***Near Duplicates***")
[print(k,":", v) for k,v in near_dupes.items()]

# Counting up our near duplicates
counted_examples_collapsed = {}
for k,v in near_dupes.items(): 
    counted_examples_collapsed[k] = counted_examples[k]
    for i in v: 
        counted_examples_collapsed[k] += counted_examples[i]

print('\n')
print('***Counted Examples Collapsed w/ Near Duplicates***')
[print(k,":", v) for k,v in counted_examples_collapsed.items()]

***Counted Example Tokens***
brand 1 : 3
brand 2 : 1
gizmoa : 1
gizmo b : 1
gizmo : 2
g1mo c : 1
widgeta123 : 1
widg : 1
2jwidget : 1
widget : 2

***Near Duplicates***
brand 1 : ['brand 2']
gizmo : ['gizmoa', 'gizmo b', 'g1mo c']
widget : ['widgeta123', 'widg', '2jwidget']

***Counted Examples Collapsed w/ Near Duplicates***
brand 1 : 4
gizmo : 5
widget : 5
[None, None, None]

Calculating Document Similarity

You may want to view responses to survey questions that are like one another during qualitative inspection of your data. Alternatively, you may want to reduce the size of your data by filtering out redundant responses to calculate document similarity. The general approach to document similarity is as follows:

First, the documents are tokenized. If you’re a bit fuzzy on tokenization, read our previous blog post explaining the topic.
Next, we will vectorize each document. This can be as simple counting how many times in a document a given token appears for each token in our corpus. A better option is Term Frequency-Inverse Document Frequency (TF-IDF), which weights the term occurrence frequency by the inverse of the document frequency (i.e., the number of documents including the term). As a result, TF-IDF assigns more weight to the important and distinctive words in a document.
Finally, we compute similarity metrics between each document’s vector to create a matrix of similarity scores. A solid approach for this is cosine similarity. Without getting too far in the weeds, cosine similarity scores range between 0 (lowest) and 1 (highest), and it is conceptually related to the Pearson Correlation.

If this sounds like a lot of work, fortunately the sci-kit learn contains most of what we need, beginning with a TFIDFvectorizer initialization. Calling the .fit_transform( ) method with a corpus first tokenizes the documents. The TFIDFvectorizer enables the optional definition of the tokenizer to use—in practice, using the text cleaning functions defined in our previous blog post, “Writing Open-Ended Survey Questions for Computational Analysis: Part II.”

Next, it vectorizes the documents using a TF-IDF score and outputs a document X term (i.e., tokens) matrix. You can then pass this matrix to sci-kit learn’s cosine similarity function to calculate similarity scores.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Toy corpus for our example
doc_examples = ['We use data, statistical algorithms, and machine learning to help you make business decisions and targeted digital marketing efforts based on potential outcomes.', 
                'With propensity modeling, you can predict the likelihood of a visitor, lead, or current customer to perform a certain action on your website (i.e. browse your site, click a CTA, pick up their phone to call).',
                'Once you can anticipate future customer and user behavior, you can plan for possible challenges and obstacles you’ll need to help that customer or user overcome.', 
                "Evolytics can help you with your data science needs.", 
                "Evolytics can help you with data visualization.",
                "Evolytics can help you with your data analysis."
               ]

vectorizer = TfidfVectorizer()
doc_term_matrix = vectorizer.fit_transform(doc_examples)

print("Number of Documents: ", doc_term_matrix.shape[0]) 
print("Number of Unique Tokens: ", doc_term_matrix.shape[1])
print("Shape of Document X Term Matrix: ", doc_term_matrix.shape)
print("\n")

print('**** Cosine Similarity Matrix ****')
print(cosine_similarity(doc_term_matrix))

Number of Documents:  6
Number of Unique Tokens:  68
Shape of Document X Term Matrix:  (6, 68)

[[1.         0.07513442 0.17156885 0.08527618 0.10476091 0.09706002]
 [0.07513442 1.         0.14814629 0.15789534 0.08892584 0.17971401]
 [0.17156885 0.14814629 1.         0.12761192 0.15676993 0.1452459 ]
 [0.08527618 0.15789534 0.12761192 1.         0.53381905 0.61900007]
 [0.10476091 0.08892584 0.15676993 0.53381905 1.         0.60758453]
 [0.09706002 0.17971401 0.1452459  0.61900007 0.60758453 1.        ]]

Now we have a matrix of similarity scores, but we still need to to retrieve similar documents! First, replace all values along the matrix diagonal with zero because these values represent a comparison of a document to itself and will, therefore, always be 1. Next, for each row, use the NumPy argwhere( ) method which returns the indices of documents whose similarity score exceeds our provided similarity threshold. A pair of simple functions are defined below:

# Finding matches 

cos_sim_matrix = cosine_similarity(doc_term_matrix)

def get_indices(m, i, threshold): 
    row = m[i]
    return [i[0] for i in np.argwhere(row>threshold)]


def get_matches(m, index=None, threshold=.5): 
    
    np.fill_diagonal(m, 0)
    
    if index: 
        return {f"Document {index}": get_indices(m, index, threshold)}
    
    else: 
        _,num_rows = m.shape
        docs = {}
        for j in range(num_rows): 
            doc = f"Document {j}"
            indices = get_indices(m, j, threshold)
            docs[doc]=indices
        
        return docs 

print("***Matches for a Specific Document***")
print(get_matches(cos_sim_matrix, index = 4))
print('\n')

print("***Matches for All Documents***")
matches = get_matches(cos_sim_matrix)
print(matches)
print('\n')

print("***Example of Matching Documents***")
print(doc_examples[4])
for i in matches["Document 4"]: 
    print(doc_examples[i])

***Matches for a Specific Document***
{'Document 4': [3, 5]}

***Matches for All Documents***
{'Document 0': [], 'Document 1': [], 'Document 2': [], 'Document 3': [4, 5], 'Document 4': [3, 5], 'Document 5': [3, 4]}

***Example of Matching Documents***
Evolytics can help you with data visualization.
Evolytics can help you with your data science needs.
Evolytics can help you with your data analysis.

Named Entity Recognition

The final item to discuss is named entity recognition. Our survey client had only one response field for customers to explain their ratings of multiple competitor brands. The challenge was figuring out how to attribute each sentence in the rating explanation to a given brand. We used the similarity methods above to iterate through sentences with responses to find brand names and make informed guesses about attribution. But what would you do without a list of brands or products you want to find?

Named Entity Recognition uses part-of-speech tagging to detect proper names, places, organizations, products, and brands. It does this by examining the structure of a token in relation to others. NER identifies entities that are likely brands or products. Use the code snippet below to extract named entities from an example bio paragraph.

# Code from https://stackoverflow.com/questions/24398536/named-entity-recognition-with-regular-expression-nltk

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue
    if current_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)
            current_chunk = []
    return continuous_chunk

text = "Dr. Sanders taught communication technologies, marketing " \
       "research and digital marketing at the University of Louisville. " \
       "Currently he works at Evolytics in Parkville, Missouri. His wife, " \
       "Anna, is a resident ER physician at University Health in Kansas City, "\
       "Missouri."

get_continuous_chunks(text)

['Sanders',
 'University',
 'Louisville',
 'Evolytics',
 'Parkville',
 'Missouri',
 'Anna',
 'University Health',
 'Kansas City']

Conclusion

This blog post demonstrates how we solved our survey client’s problem which involved counting misspelled competitor brands consumers listed in an open response survey. We also covered how to calculate and retrieve similar survey responses by calculating document similarity. Finally, we illustrated how to use named entity recognition to create a list of people, brands, products, and organizations discussed in a text.

Written By

Scott Sanders, Ph.D.

Scott Sanders, Ph.D., Senior Manager of Data Science, uses experimentation and computational methods to help clients make data driven choices. He has published research on brand/consumer engagement via social media, user evaluation of recommender systems, and the factors that determine price premiums in online auctions.

Read Previous Post

Blog

How to Find Near Duplicate Text and Recognize Name Entities in Survey Responses

Scott Sanders, Ph.D.

Get Rid of Spam and Identify Common Responses

Finding Near Duplicate Tokens

Calculating Document Similarity

Named Entity Recognition

Conclusion

Written By

Scott Sanders, Ph.D.