Best OCR Program to use.

venom4728a · April 27, 2020

I am trying to create the first 8 A.N.A.L.O.G. magazine disks. I am not having good luck with copy and paste from the PDF or the opensource OCR software i downloaded. I ended up with so many errors, It would have been easier to just type the lines in manually. Does any one know of a better way or a good working OCR program to use?

Justin Payne · April 27, 2020

I don't know if this helps or just muddies the waters but I did try to extract BASIC programs from old magazines. Of course, it was that HORRIBLE dot matrix type which OCR hates BUT I heard that you could train the AI in Google's Tesseract. Unfortunately, the programs I found that used it were clunky and didn't result in any better results. Now, it's been a year since I did this and I think Google has probably improved on this BUT there also there to make money so I think you can use this service but on a limited amount of calls to it per day. What does that mean for you if you don't plan to write your own application? It means that if you use a program that somebody else wrote, you'll also have to know what the limitation are of the Google service since that program will also have to adhere to it.
Here is a link to the Google OCR stuff.

https://cloud.google.com/vision/docs/ocr

I just did a real quick search and found this process. Give it a shot.
https://pdf.wondershare.com/pdf-software-comparison/google-docs-convert-scanned-pdf-text.html

Edited April 27, 2020 by Justin Payne

jamm · April 27, 2020

34 minutes ago, venom4728a said:

I am trying to create the first 8 A.N.A.L.O.G. magazine disks. I am not having good luck with copy and paste from the PDF or the opensource OCR software i downloaded. I ended up with so many errors, It would have been easier to just type the lines in manually. Does any one know of a better way or a good working OCR program to use?

The open source programs I've tried need a great deal of training to make them at all useful. ABBYY FineReader is considered one of the best current commercial readers. I'm not sure what the limitations of their free trial are, but you may want to take look there.

TGB1718 · April 27, 2020

I use Power PDF Standard, attached is the output from "assembly language programmers guide.pdf"

You can see there are many things it couldn't handle, but most of the text looks ok, with a bit of fiddling

of the parameters, it may produce a better quality output.

assembly language programmers guide1.docx

TGB1718 · April 27, 2020

10 minutes ago, jamm said:

ABBYY FineReader is considered one of the best current commercial readers

I have the 2016 version of this, currently waiting for it to complete the conversion, takes a few minutes.

Update: Done now

Here's the Abbyy FineReader for comparison

assembly language programmers guide1-1.docx

Edited April 27, 2020 by TGB1718
Update

Justin Payne · April 27, 2020

11 minutes ago, TGB1718 said:

I have the 2016 version of this, currently waiting for it to complete the conversion, takes a few minutes

;-)

venom4728a · April 27, 2020

I will try it from google drive, thank you for your response.

I downloaded Tesseract from the github this morning and after install, I had one application named consol.exe. I opened it and all I ended up with was a dos cmd window. seemed useless.

I read a few articles about how to improve the accuracy of OCR programs reading dot matrix printer output, similar to how the listings appear in the A.N.A.L.O.G. Magazine. It involves using photo filters to blur the dots togther and using other tools to make the letter appear more solid. It also seemed like it would take more time to do than simply typing the files in. If it was just one magazine it might be worth trying. I have the first magazine nearly done, and am working on the second issue.

evilmoo · April 27, 2020

Microsoft OneNote will also do OCR. Not sure how well it will do with program listings though.

+Nezgar · April 27, 2020

I had previously inquired as well, after some challenges with Adobe Acrobat without much useful response either... After spending hours manually correcting 2 club newsletters in Adobe Acrobat, I was dismayed that when uploading to Internet Archive, it discarded my corrected OCR and replaced it with their own... I will not be continuing with that effort for the remaining 48. ?

Anyhow... On the other hand in the name of just getting things archived, maybe it's better to just skip the labour of fixing it and get them online, letting Internet Archive's automatic OCR prevail...

jamm · April 27, 2020

5 minutes ago, Nezgar said:

I had previously inquired as well, after some challenges with Adobe Acrobat without much useful response either... After spending hours manually correcting 2 club newsletters in Adobe Acrobat, I was dismayed that when uploading to Internet Archive, it discarded my corrected OCR and replaced it with their own... I will not be continuing with that effort for the remaining 48. ?

Anyhow... On the other hand in the name of just getting things archived, maybe it's just better to just skip the labour of fixing it and let Internet Archive's automatic OCR prevail...

I know Archive.org will automatically OCR images (it looks like they use some version of Abbyy) - but will it actually modify the PDF you put up and discard embedded text? It looks like it creates new, separate text-only and PDF+OCR files for download.

+Allan · April 27, 2020

1 hour ago, venom4728a said:

I am trying to create the first 8 A.N.A.L.O.G. magazine disks. I am not having good luck with copy and paste from the PDF or the opensource OCR software i downloaded. I ended up with so many errors, It would have been easier to just type the lines in manually. Does any one know of a better way or a good working OCR program to use?

First of all, I want to thank you venom4728a for doing this. I did this for the first nine issues of Antic. It was a fun but frustrating project (but worth it).

My older experience with trying to OCR program listings was that it wasn't worth it. When I did sit down and did the early Antic issues, I just :

1. Tried to find the individual programs that were on the Net.

2. Which ever ones I couldn't find, I just typed them in.

Part of the problem with OCRing them is the source scans themselves. Most of them on the Net are of lower resolution so that makes it harder for the programs to OCR.

You probably would be better off re-scanning all the listings at a higher res and use them. Of course you have to have the physical magazines and have to spend them time doing it.

Maybe the newer OCRing software could do a good job with high-res scans. I really don't know since I gave up on doing that years ago. Hopefully they can now.

It would be great if you put them together like I did with the Antic disks and use the earlier Analog menu program that was on the Analog disks.

If you want some help typing in some of the programs, let me know. (Private message me.) Be glad to help.

Allan

ivop · April 27, 2020

Another problem with OCR and Atari listings are the control characters. Remember ↰ (closest Unicode character) for clear screen?

jamm · April 27, 2020

We need an OCR library specifically trained for Atari source code. Blurry, dot-matrix, and ATASCII!

+Nezgar · April 27, 2020

2 hours ago, jamm said:

I know Archive.org will automatically OCR images (it looks like they use some version of Abbyy) - but will it actually modify the PDF you put up and discard embedded text? It looks like it creates new, separate text-only and PDF+OCR files for download.

Archive.org was indeed replacing my embedded text with its own, at least in March 2019. I'm testing again with a test upload right now. We'll see what it does...

Update: After a couple hours, I see additional metadata and separate OCR-text only files alongside, but my original uploaded file appears to have remained intact.

_The Doctor__ · April 27, 2020

in an effort to 'help' it hurts... this has happened for quite some time... maybe they'll give it a look now.

+DjayBee · April 27, 2020

A binary version of Tesseract can be found here at Uni Mannheim.

I am currently scanning a 8-pin Epson FX printout of "A Hithchiker's Guide to the BIOS" for the Atari ST which I will OCR using Tesseract and later today upload to archive.org.

I dont't have the time to QA all the OCR results, and instead keep the original scans as TIFF files.

~~But if anyone wants to do "proof read" the result, feel free and give feedback on Tesseract's quality.~~

The OCRed result is unbelievable blurb

Edited April 27, 2020 by DjayBee

+DjayBee · April 27, 2020

3 hours ago, Nezgar said:

Update: After a couple hours, I see additional metadata and separate OCR-text only files alongside

archive.org's OCR is much better than Tesseract; but still not usable.

Scan result (unfortunately my ribbon was not very good):

grafik.png.dc9178bf411d3d0372aecafaf5e041ca.png

Original text:

A Hitchhiker's
Guide to the
BIOS
(C) 1985 Atari Corp.
All RIghts Reserved

Tesseract:

Poo han bara
Iship cies bo
Mwoi d fAadf su
i t.
All Rights Heserv o “eae

archive.org:

A Hi txhhi ker 
buide to the 
t* i. Ub 
1985 A t a r i C q r p „ 
R i q li t s R e s e r v e d

Edited April 27, 2020 by DjayBee

Justin Payne · April 28, 2020

Dumb question but did you train the fonts for Tesseract?

+DjayBee · April 28, 2020

1 hour ago, Justin Payne said:

Dumb question but did you train the fonts for Tesseract?

Nope, no idea on how to do that.

Justin Payne · April 28, 2020

8 minutes ago, DjayBee said:

Nope, no idea on how to do that.

So, that binary you pointed us to has a link on the right side of the screen titled, "Fonts for Tesseract Training." --> https://github.com/UB-Mannheim/tesseract/wiki/Fonts-for-Tesseract-training and this is also a link to this tool. https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html

venom4728a · April 29, 2020

I downloaded a PDF Reader called PDF-Xchange. It has a OCR option built in to the software with additional add on OCR software packages available. The weird thing is if I just select the text from the PDF using the "Select Text" tool, I had better results than using the built in "OCR" tool. Still not good enough to use though.

Too many instance of these getting mixed up:

O & 0

I & 1

8 & B

H & M

? & 7

5 & S

Thank You for all the responses! I am going to do it the old fashioned way. It seems like too much work trying to train the OCR programs for 8 magazine. The Tesseract software sounds cool, but I am not interested in putting the time it appears to take into learning how to use it.

Rybags · April 29, 2020

I seem to remember getting Omnipage each a couple of times over a period of 15 years.

Amazing how little progress OCR has made in this time especially considering the amount of power we can now throw at it vs the late 90s.

But I do remember having some success. The problem with old listings using 9-pin impact is that they're just pretty crap to begin with.

There's things that can help - font training if available but running scans through PhotoShop or Gimp can help - ideally you want monochrome with as few luma levels as possible. Posturize and similar functions can help there.

Kyle22 · April 30, 2020

Acrobat XI Pro is what I have. Is there something better out there for BASIC and Asm code?

livorno · May 23, 2020

As someone who has made the fun but nevertheless time-consuming mistake of duplicating NuY's work on Computer & Video Games type-ins, you might check NuY's SpartaDOS archive of Analog issues 1-8 on this thread

https://atariage.com/forums/topic/236977-all-the-early-analog-magazine-atrs/page/2/#comments

Asaki · May 23, 2020

I was trying to OCR some D&D PDFs, and had terrible results from every free one I tried, ESPECIALLY the highly recommended ones.

Surprisingly, this one was pretty good: https://www.onlineocr.net/

Idunno if it'll work out for dot-matrix font, but everything I tried came out great. Just had to go through and change a few things...the usual 1/I O/0 mixup, a few spaces where there shouldn't be, and commonly orcs/ores and stats/stars. It gets a little confused over italics sometimes.

I think it's limited to like 50 pages per day, though (if you sign up), but there are...ways around that...¬_¬ Or, you know, if it works out, I think you can pay to remove limitations.

Best OCR Program to use.

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members