“It is interesting to note that three distinct approaches to this problem have emerged over the past decades: A Bayesian, a Frequentist, and a Deep Learning based.”
I wrote this in last week’s post and was convinced I would need to use at least one of these approaches. Now, four days later, I have indeed fixed about 80 % of OCR errors (yesterday, to be exact). But neither did I use a Frequentist approach, nor a Bayesian, nor did I train some model. Rather, my own brain now laughs at how I was ridiculously overthinking the problem.
What happened? Well, first, I know that I tend to overthink stuff. Whenever I have problems, I tend to search the problem space as an archaeologist would search a site. Additionally, I tend to think “No, it must be more complicated than that.” As if every solution requires an advanced degree in Quantum physics. And my last OCR-post definitely fell into that category. I mean, look at the OCR errors I described there. Almost all of them followed patterns.
Does this ring a bell? Pattern? No? Well, it didn’t for me and that’s why I feel stupid now. But it should’ve immediately directed my attention towards what a colleague of mine, who’s much more experienced with OCR, told me two days ago: regular expressions. He told me that in their department, they get rid of most OCR errors just by dropping some RegExp’s into the mix. And that’s it.
Following this, I opened a few of the files randomly, and looked at obvious errors (where innocent Latin letters became replaced with monstrous ASCII-art). And, sure enough, there were many errors that repeated themselves thousands of times. So I spent the whole day looking through the files for obvious errors, then run a search across all files to see if they consistently represented the same error (some errors had more than one solution, i.e. either “P” or “R”, or “n” or “u”) and then add them to a list of errors to fix. And then I simply ran that function over all files. Looking at the files later, I saw: Most errors were actually fixed.
So what did the function look like? Lo and behold:
def fix_line(line):
"""
Fixes very common OCR errors.
"""
# My, oh my
line = line.replace("~Ir", "Mr")
line = line.replace("¥r", "Mr")
line = line.replace("1\\!r", "Mr")
line = line.replace("1\\!R", "MR")
line = line.replace("Jtfr", "Mr")
line = line.replace("It1r", "Mr")
line = line.replace("l'¥Ir", "Mr")
line = line.replace("llfr.", "Mr.")
line = line.replace("~lr.", "Mr.")
line = line.replace("llfrs.", "Mrs.")
line = line.replace("Jtfa", "Ma")
line = line.replace("JtfA", "MA")
# ASCII-Art
line = line.replace("l\\I", "M")
line = line.replace("1\\I", "M")
line = line.replace("l'tf", "M")
line = line.replace("!\\'", "N")
line = line.replace("p1·", "pr")
line = line.replace("p1•", "pr")
line = line.replace("'!'", "T")
line = line.replace("()'", "g")
line = line.replace("(l'", "g")
line = line.replace("!IT", "gr")
line = line.replace(" \\Y", " W")
line = line.replace("n<>\"", "ng")
line = line.replace("'\\'\\\"\"", "w") # '\'\""
line = line.replace("'\\\"\"", "v")
line = re.sub(r"([a-zA-Z])\(l ", "\1d", line)
# Dirt on the scan
line = line.replace("·", "")
line = line.replace("•", " ")
line = re.sub(r"([a-zA-Z])[\.,;:]([a-zA-Z])", "\1\2", line)
return line
As you can see: Most errors didn’t even require the use of a regular expression because they were so predictable. And, thinking about it, this absolutely makes sense: The OCR engine runs on a computer, and being deterministic is one of those fundamental properties of computers. That means: If the engine would encounter a “W” and replace it with some odd character mix, it is almost certain that the next occurrence of “W” would be replaced with the same odd character mix. So many very obvious errors could be mitigated simply by search and replace, it is almost comical.
However, while that made up a large share of the OCR errors, there are more complicated ones. Interestingly, these follow only slightly more complicated patterns. Have a look at, e.g., this comment I wrote down: “<r can be seen as a "g" almost every time, once we have a "c" (as<rertaining) and also once an "e"”. Or this one: “I> can be either D or P or B/b, sometimes O”
So, how does one fix these? Well, easy (almost). One of the initial ideas I had (even before the last article I wrote) was that most OCR-errors could be conceptualized as spelling mistakes. So, using a spellchecker, it might be easy to fix those, right? And that’s how I want to approach the next iteration of OCR correction: Have a look at many common mistakes and have a spellchecker predict which word is the most likely correct one. It’s using Levenshtein distances to determine them and additional take into account the relative frequency of the terms in my dataset, so in most cases it should be right. I’ll keep a log of which word got replaced by which other one, and iteratively improve potential new errors. But afterwards, I should have a more or less perfectly cleaned dataset! Hooray!
Some other errors are still in the pipeline, though. The re-segmentation issue is still true, for instance, and I have even found out that sometimes there’s not too much whitespace, but actually too less whitespace (since sometimes wordsarewrittenlikethis
). And there’s still the problem of page headers (but there’s a good indicator for what these are, so stay tuned for how I’ll fix those).
As you can see: Sometimes the most straightforward way of solving problems is the best one to start with. You can always go crazy, but if you are like me and tend to overthink things, it’s almost always a good idea to just ask other people how they would approach the problem. Ask questions, folks!
Until next time!