Tuesday 27 November 2018

Coordinates (4): way if you hate regex

If we don't want to use regular expression, we need to change our approach to solve the 'variety of coordinates format' problem.

This time we will
  • 'normalize' input into more friendly format (e. g. replace all kind of separators (delimiters) with space)
  • parse output sequentially - decision what to do next depends on what you get from previous step, e.g if you split intermediate result into list and get 3 items - it means that the first are degrees, second - minutes, third - seconds (to be more precise - we make such assumption) and validate them to check if it make sense
Let's create a function that will convert coordinates into DD which takes 2 arguments:


def coord2dd(coord_input, c_type):

coord_input: input coordinate to be converted (it also might be DD format!)
c_type: coordinate type (latitude, longitude) to make sure that coordinate is within appropriate range

Definition of c_type constants:

C_LAT = 'C_LAT'
C_LON = 'C_LON'

First we 'normalize input':


coord_norm = coord_input.strip()
coord_norm = coord_norm.upper()
coord_norm = coord_norm.replace(',', '.')  


remove leading an trailing spaces, capitalize all characters (to make sure that hemisphere letter is N not n for example), replace decimal comma separator with dot comma separator (some countries uses comma instead of dot).

What are w going to do know, is to check if input is one of the following formats sequentially:
  • DD
  • DMS delimited (delimiters hyphen, space, decimal sign etc.)
  • DM delimited
  • DD with hemisphere letter
So, let's defined variable dd which will keep result:

dd = None 

And check if input is in DD format:

try:
    dd = float(coord_norm)
except ValueError:
    # Code to be executed in case input is not in DD format

Notice, that validation if dd (derived DD format) is within range will be added later.

If input is not in DD format, code after except statement will be executed.
Our assumption is that input is in DMS, DM delimited format or DD with hemisphere letter prefix
or suffix. Because we don't know what delimiter is used in input a good idea is to convert all delimiters into the same, the space.


for sep in S_ALL:
    coord_norm = re.sub(sep, ' ', coord_norm)

coord_norm = re.sub('\s+', ' ', coord_norm)

S_ALL is list of separators. Definition of separators:

S_SPACE = ' '
S_HYPHEN = '-'
S_DEG_WORD = 'DEG'
S_DEG_LETTER = 'D'
S_MIN_WORD = 'MIN'
S_MIN_LETTER = 'M'
S_SEC_WORD = 'SEC'

S_ALL = [S_SPACE, S_HYPHEN, S_DEG_WORD, S_DEG_LETTER, S_MIN_WORD, S_MIN_LETTER, S_SEC_WORD]

As you cane see, we can add other separators (delimiters) by extending list S_ALL and we do not have to change function to be able to parse more formats!

Now, check if first or last character in 'normalized' input is hemisphere letter. To make it, and later work, a little bit esier, we need following constants:

H_LAT = ['N', 'S']
H_LON = ['E', 'W']
H_NEGATIVE = ['S', 'W']
H_ALL = ['N', 'S', 'E', 'W']



Needless to say, writing all regex that will meet all potential formats of coordinate is not easy task.
However, you can write:
  • single regex for each format
  • a few more complex, or even one 'super-complex' regex that takes into account all possible cases
  • a function, or class that will built regular expression instead of us on the basis of all possible (valid) combinations of degrees, minutes, seconds and separators

No comments:

Post a Comment