Wednesday, 27 March 2019

How to 'convert' table from pdf to csv (example of extracting points)

Let's imagine that you have  pdf files with some data. You need to extract some of this data for further processing: e. g. file contains table with points with some attributes such as name and your goal is to get points layer from this file. If tables are nicely formatted in the way that if you copy and paste them into text file you will get something that looks like a 'table', I have a good news for you: there is no need to use extra libraries of tools to extract desired data. You can extract data manually by using spreadsheet for example (which is something we are not going to do), websites which converts pdf into other format and etc. You can also take advantage of the fact, that copied table is 'nicely' formatted  and write parser that will extract desired data. It is worth to do that, if we are going to process data in cycles, for example parts of AIP.


We have 'table' in text file:
 ABCDE       010101.10N 0010101.10W ABC DEF 100 500
       ABCDE       010101.10N      0010101.10W ABC DEF  100 500
ABCDE 010101.10N 0010101.10W ABCDEF 100 500
ABCDE 010101.10N 0010101.10W 


It is easy to spot following pattern:
[variable_blank][point_name][DDMMSS.ssH][variable_blanks][DDDMMSS.ssH][variable_blanks][other_stuff]

We are going to use regular expression to parse each line of file, and write result into csv file. Morwover, we are goint to convert data from DMS into DD format and store it in resul file. We know format - so let's use 'predetermined' angle. Let's start with importing required module and building regular expression:


import re
import csv
from core_predermined_angle import PredeterminedCoordinates2DD


regex_points = re.compile(r'''(?P<ident>[A-Z0-9]{5})
                              (\s+)
                              (?P<lat>\d{6}(\.d+)?[NS])
                              (\s+)
                              (?P<lon>\d{7}(\.\d+)?[EW])
                              (.*?)
                            ''', re.VERBOSE)

Now we are ready to define function that will parse text file with data and write resutl to csv file. Function will take following arguments: input file path, output file path, ICAO country code from which data is and waypoint type.
As I said, we are going to convert DMS into DD format, an this will be done by using instance of PredeterminedCoordinates2DD class from core_predermined_angle module. Before we will parse line of file, a bit of 'normzlization' is required: removing leading blanks and reomving new line characters.


def parse_points_txt2csv(input_file, output_file, icao_ctry, wpt_type):

    dd_coord = PredeterminedCoordinates2DD()

    csv_fnames = ['ctry_icao_code', 'ident', 'wpt_type', 'lat_src', 'lon_src', 'lat_dd', 'lon_dd']
    with open(input_file, 'r') as in_file:
        with open(output_file, 'w', newline='') as csv_file:
            writer = csv.DictWriter(csv_file, fieldnames=csv_fnames, delimiter=';')
            for line in in_file:
                line_mod = line.lstrip()
                line_mod = line_mod.strip('\n')

If 'normalized' line matches regex - extract ident (point name) and latitude and longitude:


if regex_points.match(line_mod):
    groups = regex_points.search(line_mod)
    ident = groups.group('ident')
    lat_src = groups.group('lat')
    lon_src = groups.group('lon')

    dd_coord.dmsh2dd(lat_src, lon_src)

Note that regular expression 'does not check' if DMS is valid (e.g. minutes are less then 60), we need to check if DMS 'is valid' during conversion to DD format. If is, that means that bothn latitude and longitude has been converted succesfully, write result into csv file:

if dd_coord.is_valid is True:
    writer.writerow({'ctry_icao_code': icao_ctry,
                     'ident': ident,
                     'wpt_type': wpt_type,
                     'lat_src': lat_src,
                     'lon_src': lon_src,
                     'lat_dd': str(dd_coord.lat_dd),
                     'lon_dd': str(dd_coord.lon_dd)})

And that is all, we are read to extract data from 'text file' tables.

No comments:

Post a Comment