Jump to content
IGNORED

Best OCR Program to use.


venom4728a

Recommended Posts

I am trying to create the first 8 A.N.A.L.O.G. magazine disks.  I am not having good luck with copy and paste from the PDF or the opensource OCR software i downloaded.  I ended up with so many errors, It would have been easier to just type the lines in manually. Does any one know of a better way or a good working OCR program to use?

 

 

  • Like 1
Link to comment
Share on other sites

I don't know if this helps or just muddies the waters but I did try to extract BASIC programs from old magazines. Of course, it was that HORRIBLE dot matrix type which OCR hates BUT I heard that you could train the AI in Google's Tesseract. Unfortunately, the programs I found that used it were clunky and didn't result in any better results. Now, it's been a year since I did this and I think Google has probably improved on this BUT there also there to make money so I think you can use this service but on a limited amount of calls to it per day. What does that mean for you if you don't plan to write your own application? It means that if you use a program that somebody else wrote, you'll also have to know what the limitation are of the Google service since that program will also have to adhere to it.
Here is a link to the Google OCR stuff.

https://cloud.google.com/vision/docs/ocr

I just did a real quick search and found this process. Give it a shot.
https://pdf.wondershare.com/pdf-software-comparison/google-docs-convert-scanned-pdf-text.html

Edited by Justin Payne
  • Like 1
Link to comment
Share on other sites

34 minutes ago, venom4728a said:

I am trying to create the first 8 A.N.A.L.O.G. magazine disks.  I am not having good luck with copy and paste from the PDF or the opensource OCR software i downloaded.  I ended up with so many errors, It would have been easier to just type the lines in manually. Does any one know of a better way or a good working OCR program to use?

 

 

The open source programs I've tried need a great deal of training to make them at all useful.  ABBYY FineReader is considered one of the best current commercial readers.  I'm not sure what the limitations of their free trial are, but you may want to take look there.

 

Link to comment
Share on other sites

10 minutes ago, jamm said:

ABBYY FineReader is considered one of the best current commercial readers

I have the 2016 version of this, currently waiting for it to complete the conversion, takes a few minutes.

 

Update: Done now

 

Here's the Abbyy FineReader for comparison

assembly language programmers guide1-1.docx

Edited by TGB1718
Update
Link to comment
Share on other sites

I will try it from google drive, thank you for your response.

 

I downloaded Tesseract from the github this morning and after install, I had one application named consol.exe.  I opened it and all I ended up with was a dos cmd window. seemed useless.  

 

I read a few articles about how to improve the accuracy of OCR programs reading dot matrix printer output, similar to how the listings appear in the A.N.A.L.O.G. Magazine.  It involves using photo filters to blur the dots togther and using other tools to make the letter appear more solid.  It also seemed like it would take more time to do than simply typing the files in.  If it was just one magazine it might be worth trying.  I have the first magazine nearly done, and am working on the second issue.

 

 

  • Like 1
Link to comment
Share on other sites

I had previously inquired as well, after some challenges with Adobe Acrobat without much useful response either... After spending hours manually correcting 2 club newsletters in Adobe Acrobat, I was dismayed that when uploading to Internet Archive, it discarded my corrected OCR and replaced it with their own... I will not be continuing with that effort for the remaining 48. ?

 

Anyhow... On the other hand in the name of just getting things archived, maybe it's better to just skip the labour of fixing it and get them online, letting Internet Archive's automatic OCR prevail...

Link to comment
Share on other sites

5 minutes ago, Nezgar said:

I had previously inquired as well, after some challenges with Adobe Acrobat without much useful response either... After spending hours manually correcting 2 club newsletters in Adobe Acrobat, I was dismayed that when uploading to Internet Archive, it discarded my corrected OCR and replaced it with their own... I will not be continuing with that effort for the remaining 48. ?

 

Anyhow... On the other hand in the name of just getting things archived, maybe it's just better to just skip the labour of fixing it and let Internet Archive's automatic OCR prevail...

I know Archive.org will automatically OCR images (it looks like they use some version of Abbyy) - but will it actually modify the PDF you put up and discard embedded text?  It looks like it creates new, separate text-only and PDF+OCR files for download.

Link to comment
Share on other sites

1 hour ago, venom4728a said:

I am trying to create the first 8 A.N.A.L.O.G. magazine disks.  I am not having good luck with copy and paste from the PDF or the opensource OCR software i downloaded.  I ended up with so many errors, It would have been easier to just type the lines in manually. Does any one know of a better way or a good working OCR program to use?

 

 

First of all, I want to thank you venom4728a for doing this. I did this for the first nine issues of Antic. It was a fun but frustrating project (but worth it).

 

My older experience with trying to OCR program listings was that it wasn't worth it. When I did sit down and did the early Antic issues, I just :

 

1. Tried to find the individual programs that were on the Net.

2. Which ever ones I couldn't find, I just typed them in.

 

Part of the problem with OCRing them is the source scans themselves. Most of them on the Net are of lower resolution so that makes it harder for the programs to OCR.

You probably would be better off re-scanning all the listings at a higher res and use them. Of course you have to have the physical magazines and have to spend them time doing it.

Maybe the newer OCRing software could do a good job with high-res scans. I really don't know since I gave up on doing that years ago. Hopefully they can now.

 

It would be great if you put them together like I did with the Antic disks and use the earlier Analog menu program that was on the Analog disks. 

 

 

If you want some help typing in some of the programs, let me know. (Private message me.) Be glad to help.

 

Allan

 

 

 

  • Like 1
Link to comment
Share on other sites

2 hours ago, jamm said:

I know Archive.org will automatically OCR images (it looks like they use some version of Abbyy) - but will it actually modify the PDF you put up and discard embedded text?  It looks like it creates new, separate text-only and PDF+OCR files for download.

Archive.org was indeed replacing my embedded text with its own, at least in March 2019. I'm testing again with a test upload right now. We'll see what it does...

 

Update: After a couple hours, I see additional metadata and separate OCR-text only files alongside, but my original uploaded file appears to have remained intact.

  • Sad 1
Link to comment
Share on other sites

A binary version of Tesseract can be found here at Uni Mannheim.

 

I am currently scanning a 8-pin Epson FX printout of "A Hithchiker's Guide to the BIOS" for the Atari ST which I will OCR using Tesseract and later today upload to archive.org.

I dont't have the time to QA all the OCR results, and instead keep the original scans as TIFF files.

 

But if anyone wants to do "proof read" the result, feel free and give feedback on Tesseract's quality. ;)

The OCRed result is unbelievable blurb

Edited by DjayBee
  • Like 1
Link to comment
Share on other sites

 

3 hours ago, Nezgar said:

Update: After a couple hours, I see additional metadata and separate OCR-text only files alongside

 

archive.org's OCR is much better than Tesseract; but still not usable.

 

Scan result (unfortunately my ribbon was not very good):

grafik.png.dc9178bf411d3d0372aecafaf5e041ca.png

Original text:

A Hitchhiker's
Guide to the
BIOS
(C) 1985 Atari Corp.
All RIghts Reserved

Tesseract:

Poo han bara
Iship cies bo
Mwoi d fAadf su
i t.
All Rights Heserv o “eae

archive.org:

A Hi txhhi ker 
buide to the 
t* i. Ub 
1985 A t a r i C q r p „ 
R i q li t s R e s e r v e d 

 

Edited by DjayBee
  • Like 2
  • Sad 1
Link to comment
Share on other sites

8 minutes ago, DjayBee said:

Nope, no idea on how to do that.

So, that binary you pointed us to has a link on the right side of the screen titled, "Fonts for Tesseract Training." --> https://github.com/UB-Mannheim/tesseract/wiki/Fonts-for-Tesseract-training and this is also a link to this tool. https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00.html

  • Like 1
Link to comment
Share on other sites

I downloaded a PDF Reader called PDF-Xchange. It has a OCR option built in to the software with additional add on OCR software packages available. The weird thing is if I just select the text from the PDF using the "Select Text" tool, I had better results than using the built in "OCR"  tool.  Still not good enough to use though.

 

Too many instance of these getting mixed up:

 

O & 0

I & 1

8 & B

H & M

? & 7

5 & S

 

Thank You for all the responses!   I am going to do it the old fashioned way.  It seems like too much work trying to train the OCR programs for 8 magazine.  The Tesseract software sounds cool, but I am not interested in putting the time it appears to take into learning how to use it.

Link to comment
Share on other sites

I seem to remember getting Omnipage each a couple of times over a period of 15 years.

Amazing how little progress OCR has made in this time especially considering the amount of power we can now throw at it vs the late 90s.

 

But I do remember having some success.  The problem with old listings using 9-pin impact is that they're just pretty crap to begin with.

 

There's things that can help - font training if available but running scans through PhotoShop or Gimp can help - ideally you want monochrome with as few luma levels as possible.  Posturize and similar functions can help there.

Link to comment
Share on other sites

  • 4 weeks later...

I was trying to OCR some D&D PDFs, and had terrible results from every free one I tried, ESPECIALLY the highly recommended ones.

 

Surprisingly, this one was pretty good: https://www.onlineocr.net/

 

Idunno if it'll work out for dot-matrix font, but everything I tried came out great. Just had to go through and change a few things...the usual 1/I O/0 mixup, a few spaces where there shouldn't be, and commonly orcs/ores and stats/stars. It gets a little confused over italics sometimes.

 

I think it's limited to like 50 pages per day, though (if you sign up), but there are...ways around that...¬_¬ Or, you know, if it works out, I think you can pay to remove limitations.

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...