regex Archives : Global Nerdy

Comic by xkcd. Tap to see the source.

tl;dr: Here’s the code

It’s nothing fancy — a couple of Python one-line methods:

word_to_initialism(), which converts a word into an initialism
initialism_to_acronym(), which turns an initialism into an acronym

import re

def word_to_initialism(word):
  """Turns every letter in a given word to an uppercase letter followed by a period.
  
  For example, it turns “goat” into “G.O.A.T.”.
  """
  return re.sub('([a-zA-Z])', '\\1.', word).upper()

def initialism_to_acronym(initialism):
  """Removes the period from an initialism, turning it into an acronym.

  For example, it turns “N.A.S.A.” into “NASA”.
  """
  return re.sub('\.', '', initialism)

The project and its dictionary

I’ve been working on a Python project that makes use of a JSON “dictionary” file of words or phrases and their definitions. Here’s a sample of the first few entries in the file, formatted nicely so that they’re a little more readable:

{
   "abandoned industrial site": [
      "Site that cannot be used for any purpose, being contaminated by pollutants."
   ],

   "abandoned vehicle": [
      "A vehicle that has been discarded in the environment, urban or otherwise, often found wrecked, destroyed, damaged or with a major component part stolen or missing."
   ],

   "abiotic factor": [
      "Physical, chemical and other non-living environmental factor."
   ],

   "access road": [
      "Any street or narrow stretch of paved surface that leads to a specific destination, such as a main highway."
   ],

   "access to the sea": [
      "The ability to bring goods to and from a port that is able to harbor sea faring vessels."
   ],

   "accident": [
      "An unexpected, unfortunate mishap, failure or loss with the potential for harming human life, property or the environment.",
      "An event that happens suddenly or by chance without an apparent cause."
   ],

   "accumulator": [
      "A rechargeable device for storing electrical energy in the form of chemical energy, consisting of one or more separate secondary cells.\\n(Source: CED)"
   ],

   "acidification": [
      "Addition of an acid to a solution until the pH falls below 7."
   ],

   "acidity": [
      "The state of being acid that is of being capable of transferring a hydrogen ion in solution."
   ],

   "acidity degree": [
      "The amount of acid present in a solution, often expressed in terms of pH."
   ],

   "acid rain": [
      "Rain having a pH less than 5.6."
   ],

   "acid": [
      "A compound capable of transferring a hydrogen ion in solution.",
      "Being harsh or corrosive in tone.",
      "Having an acid, sharp or tangy taste.",
      "A powerful hallucinogenic drug manufactured from lysergic acid.",
      "Having a pH less than 7, or being sour, or having the strength to neutralize  alkalis, or turning a litmus paper red."
   ],

...

}

The dictionary’s keys are strings that represent the words or phrases, while its values are arrays, where each element in that array is a definition for that word or phrase. To look up the meaning(s) of the word “acid,” you’d use the statement dictionary["acid"].

Dictionary keys are case-sensitive. For most words and phrases in the dictionary, that’s not a problem. Any entry in the dictionary that isn’t for a proper noun (the name of a person, place, organization, or the title of a work) has a completely lowercase key. It’s easy to massage a search term into lowercase with Python’s lower() method for strings.

Any entry in the dictionary that is for a proper noun is titlecased — that is, the first letter in each word is uppercase, and the remaining letters are lowercase. Once again, a search term can be massaged into titlecase in Python; that’s what thetitle()method for strings is for.

When looking up an entry in the dictionary, my application tries a reasonable set of variations on the search term:

As entered by the user (stripped of leading and trailing spaces, and sanitized)
Converted to lowercase with lower()
Converted to titlecase with title()
Converted to uppercase with upper()

For example, for the search term “FLorida” (the “FL” capitalization is an intentional typo), the program tries querying the dictionary using dictionary["FLorida"], dictionary["florida"], and dictionary["Florida"].

Looking up words or phrases made out of initials are a little more challenging because people spell them differently:

The Latin term for “after noon” — post meridiem — is spelled as pm, p.m., PM, and P.M.
Some people write the short form for “United States of America” as USA, while others write it as U.S.A.

To solve this problem, I wrote two short methods:

word_to_initialism(), which converts a word into an initialism
initialism_to_acronym(), which turns an initialism into an acronym

Here’s the code for both…

import re

def word_to_initialism(word):
  """Turns every letter in a given word to an uppercase letter followed by a period.
  
  For example, it turns “goat” into “G.O.A.T.”.
  """
  return re.sub('([a-zA-Z])', '\\1.', word).upper()

def initialism_to_acronym(initialism):
  """Removes the period from an initialism, turning it into an acronym.

  For example, it turns “N.A.S.A.” into “NASA”.
  """
  return re.sub('\.', '', initialism)

…and here are these methods in action:

# Outputs “G.O.A.T.”
print(f"word_to_initialism(): {word_to_initialism('goat')}")

# Outputs “RADAR”
print(f"initialism_to_acronym(): {initialism_to_acronym('R.A.D.A.R.')}")

Both use regular expressions. Here’s the regular expression statement that drivesword_to_initialism():

re.sub('([a-zA-Z])', '\\1.', word)

re.sub() is Python’s regular expression substitution method, and it takes three arguments:

The pattern to look for, which in this case is [a-zA-Z], which means “any alphabetical character in the given string, whether lowercase or uppercase”. Putting this in parentheses puts the pattern in a group.
The replacement, which in this case is \\1.. The \\1 specifies that the replacement will start with the contents of the first group, which is the detected alphabetical character. It’s followed by the string literal . (period), which means that a period will be added to the end of every alphabetical character in the given string.
The given string, in which the replacement is supposed to take place.

The regular expression behind initialism_to_acronym() is even simpler:

re.sub('\.', '', initialism)

In this method, re.sub() is given these arguments:

The pattern to look for, which in this case is \., which means “any period character”.
The replacement, which is the empty string.
The given string, in which the replacement is supposed to take place.