Jump to content
  • entries
  • comments
  • views

From Russia with OCR



The Internet Archive is a treasure for old computer enthusiasts.  Archivists have been uploading old computer newsletters there for years which provide a vivid view of what it was like to be an Atari enthusiast when Atari was still a going concern.  Thanks to the archive, I've been able to read such excellent old newsletters as the MACE Journal, SLCC Journal, Current Notes, and dozens of others.


The archive itself though has some quirky behaviors.  When you upload content, it examines it and produces derived content.  In the past that included Djvu versions of the content (a feature I sorely miss), and of interest to me, OCR's of the content.


In the past, the OCR's were produced using ABBY FineReader (mostly version 11).  Recently the archive has switched to using Tesseract.  My initial impression of Tesseract is generally favorable, but like all OCR's, it can fail in interesting ways.


Take the Portland Atari Club newsletters @Allan recently added to the archive from 1984/1985.  The metadata for the uploads apparently didn't specify the language, so Tesseract has some heuristic it uses to detect what language it is.  For some reason, it decided that the November 1984 and January 1985 issues are in a combination of Latin and Cyrillic script, so the OCR results are much worse than the usual OCR results (which generally are not great on newsletters in general).  For instance PAC (for Portland Atari Club) was often converted to three Cyrillic letters that look very similar.


The reason I noticed this at all is because I have an archive of those OCR's I scan through when I want to find old reviews or Atari news.  Let's say I'm curious about how Seven Cities of Gold was received when it was released.  Using "grep" on the OCR's I have I find there was a review of it in Current Notes, October 1984 and the MACE Journal, March 1985.  I also see there was a review in v.3 n.8 of the JACG Newsletter because it's in one of their indexes, but that issue isn't archived anywhere to my knowledge.  I also spot a couple of capsule reviews (e.g. PAC Newsletter, July 1987) that I would have missed if I just kept the table of contents for all these old newsletters.


So when new Atari newsletters show up, I download the OCR's and clean them up (mostly fixing errors and de-hyphenating) so that I can search them in the future.  Someday I'll make these available, but I've attached the Portland Atari Club ones I recently did since the OCR's on the archive aren't very good.


  • Like 1


Recommended Comments

Interesting. I will try to make sure I make sure I mark it as English from now on in hopes of avoiding this. From the title I thought you were writing about finding russian Atari 8-bit stuff on Archive.

Share this comment

Link to comment
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Create New...