January — 2011 — Sam Tuke's blog

Filling in the bank’s blanks with regular expressions

I’ve recently had to do a lot of work on a set of data relating to my bank account transactions, which required a great deal of text manipulation, and working with several regular expressions. My bank doesn’t believe that giving their customers access to digital copies of their account and transaction history is important, and they only make available images (stored in PDFs) of past statements which have been posted. Because of this, I had to use Optical Character Recognition software to extract text from the images, fix by hand all the errors in the resulting output, and then manually structure the data into columns and rows by using regular expressions (as my OCR software didn’t detect them). To make matters worse, the images provided by my bank had a large text watermark on each one, written diagonally across the page stating “duplicate”. All contents of the spreadsheet which came into contact with this text was unreadable during OCR, and had to fixed by hand.