Jump to content

Photo

OCR for Code?


19 replies to this topic

#1 matthew180 OFFLINE  

matthew180

    River Patroller

  • 2,544 posts
  • Location:Castaic, California

Posted Thu Jan 4, 2018 12:49 AM

I found some old program printouts, nice 9-pin dot-matrix all-caps 99/4A BASIC-code goodness.  I took some nice high-res photos and tried running them through every OCR program (online and offline) I could find, but they all fail miserably.  All the software seems to try and determine the language and make sense of the code as English (or whatever language it auto-detects when that fails).  Does anyone know of a *dumb* OCR program that will just worry about individual characters, or any OCR designed for technical documentation, code, etc.?



#2 Opry99er OFFLINE  

Opry99er

    Quadrunner

  • 9,898 posts
  • Location:Hustisford, WI

Posted Thu Jan 4, 2018 10:53 AM

I used to work for a guy who ran a data archiving warehouse. Companies would send documents in crates, we would scan them in and then run them through an OCR program. IIRC, there was very little error checking on the OCR, as he would send the companies both the scans and the searchable OCR docs. IIRC, it was called Abby or Addy or something of the like. Pretty sure it was commercial though.

#3 Opry99er OFFLINE  

Opry99er

    Quadrunner

  • 9,898 posts
  • Location:Hustisford, WI

Posted Thu Jan 4, 2018 11:05 AM

I looked it up.. it was Abbyy.

This was in about 2007, so it would have been an older piece of software... they have more modern software available but its like $150 to purchase. Yikes....

Might find one of the older programs in public domain though. I cannot remember the actual software name, even after looking at the ones from that era.

#4 Opry99er OFFLINE  

Opry99er

    Quadrunner

  • 9,898 posts
  • Location:Hustisford, WI

Posted Thu Jan 4, 2018 11:07 AM

Additionally, if your primary interest is program recovery, I would be happy to type the programs up for you. Ill have some time this week, since the family is out of town.

#5 mizapf OFFLINE  

mizapf

    River Patroller

  • 3,393 posts
  • Location:Germany

Posted Thu Jan 4, 2018 11:17 AM

Did you already try Tesseract?



#6 matthew180 OFFLINE  

matthew180

    River Patroller

  • Topic Starter
  • 2,544 posts
  • Location:Castaic, California

Posted Thu Jan 4, 2018 1:34 PM

@Opry99er, Abbyy is one of the first programs to come up when you search for OCR, and their online service failed miserably, just like Google, and everyone else.  Right now I only have one program that I found in one of my old COMPUTE! magazines.  Apparently I made some fixes and changes though, since there is some of my chicken-scratch in the magazine.  If you really want to type in the program, here is the link to the edition with the code (starts on pg 44):

 

https://archive.org/...ompute-magazine

 

I really love the Internet Archive! (and yes I donate to it.)  I'm actually considering getting rid of my paper magazines because they have archived everything I have in paper form.  Anyway, I found the print-out tucked in the magazine and thought "gee, I should be able to OCR this in seconds these days, and I can put it up on AtariAge for people who like the old BASIC games..."  Two hours later and I was still trying to find *any* software that could OCR the code and just leave it alone.

 

@mizapf, Tesseract is the only program I did not try directly yet, mostly because it was not "quick and simple" and a lot of what I read is that it is a pain in the ass to use, you have to train it, blah blah blah.  Also, apparently, it is the engine used in things like Google's OCR and such, and that failed for me too.  However, I looked at the command line parms and you still have to specify a language, so I assume it will fail just as badly as everything else because it will be trying to find English words, which code it not.

 

This was not supposed to be a huge endeavor, and I have already wasted more time trying to get it to work than it would have taken me to just type it again.  Why is that always the way it goes when you are trying to do something you think should be simple by now?

 

I have attached the first page if anyone wants to give it a try.  I cropped the text and reduced the color to 1-bit black and white, otherwise the image is unmodified.  I read that for OCR, characters need to be at least 10-pixels tall, which is easily the case here.  After zooming in, I now see maybe the problem is too much data and separation between the dots from the printer.  Maybe reducing the image to get the dots to merge would help?  I don't know.  Again the goal was quick-and-dirty OCR, which I did not achieve.

Attached Files



#7 Opry99er OFFLINE  

Opry99er

    Quadrunner

  • 9,898 posts
  • Location:Hustisford, WI

Posted Thu Jan 4, 2018 1:43 PM

I didnt know Abbyy had an online OCR. Thats cool, but if it doesnt work, kind of useless to us.

Perhaps their commercial software would be more versatile/useful... it is expensive though.

At that document warehouse, it was a large printer, scanner, copier deal. I wasnt responsible for the scanning, I just hauled boxes from the 3 level warehouse off a pick list to the operator. It scanned directly to a computer where it was processed, then the documents were stored for 2 months and then we destroyed boxes of documents by the pallet-load.

#8 ti99iuc ONLINE  

ti99iuc

    Stargunner

  • 1,534 posts
  • Location:Italy

Posted Thu Jan 4, 2018 2:10 PM

I used this metod a pair of times and anyway needed to fix all errors after. i still haven't found a perfect one.

Anyway Acrobat help me a bit.

 

This is the result using the lowres image you attached converted using OCR of Acrobat Reader (it is not free of course :( )

Attached Files


Edited by ti99iuc, Thu Jan 4, 2018 2:14 PM.


#9 mizapf OFFLINE  

mizapf

    River Patroller

  • 3,393 posts
  • Location:Germany

Posted Thu Jan 4, 2018 2:15 PM

OK, I tried tesseract after applying a blur filter in Gimp to smear the pixels, then save as grayscale in jpg.

 

100 DIM BLOCK$(2),PLACE(2),BUILDING(32 2)
110 RANDOMIZE 5
120 REM BOMB CHARACTER
1 30 CALL CHAR ( 1 29 , " 00 1 CBEFF F FEE 1 COO “ )
140 REM CROSSHAIR CHAR
150 CALL CHAR(130,"181818FFFF181818")
160 CALL CLEAR
170 CALL SCREEN(12)
180 FOR J=5 TO 8
190 CALL CDLDRiJ,5,16)
200 NEXT J
210 FOR J=9 TD 12
220 CALL CULUR(J,2,14)
230 NEXT J
240 T=G
250 P=0
250 9:0
270 "=0
280 CALL CLEAR
290 PRINT " AIR DEFENSE"
300 PRINT
310 PRINT
320 PRINT
330 PRINT " do you need instructions ?"
340 PRINT
350 PRINT " type Y or N"
360 FOR 1:1 TD 7
370 PRINT
380 NEXT I
390 CALL KEY(3,Y,STATUS)
400 IF STATUS=0 THEN 390
410 IF Y=ASC("N“)THEN ?50
420 IF Y=ASC£"Y")THEN 520
430 CALL CLEAR
440 PRINT
450 PRINT “ you did not press Y OF N."
450 FOR I=1 TB 13
470 PRINT
480 NEXT I
490 FOR DELAY=1 TO 500


#10 mizapf OFFLINE  

mizapf

    River Patroller

  • 3,393 posts
  • Location:Germany

Posted Thu Jan 4, 2018 2:20 PM

BTW, I mentioned tesseract because I was highly surprised by its good recognition performance. I OCRed the HFDC manual that you can find on Whtech, and I also did that to the Editor/Assembler manual with only very little errors (less than 10 on a whole page).

 

Apart from the extra work with gimp (which could be done automatically by ImageMagick in the command line), I just ran

 

$ tesseract post.jpg listing

 

One thing: There are lots of empty lines which are not shown in the listing above.



#11 mizapf OFFLINE  

mizapf

    River Patroller

  • 3,393 posts
  • Location:Germany

Posted Thu Jan 4, 2018 2:30 PM

OK, another try with ImageMagick. I had to experiment with the blur vector.

 

$ convert -blur 11x2 post0.png post1.jpg
$ tesseract post1.jpg listing

 

yields this result:

 

100 DIM BLOCK$(2),PLACE(2),BUILDING(32 2)
110 RANDOMIZE 5
120 REM BOMB CHARACTER
130 CALL CHAR(129,"OOICBEFFFFBE1C00")
140 REM CROSSHAIR CHAR
150 CALL CHAR(130,"181818FFFF181818")
160 CALL CLEAR
170 CALL SCREEN(12)
180 FOR J=q TO 8
190 CALL COLDR(J,5,16)
200 NEXT J
210 FOR J=9 TO 12
220 CALL COLOR(J,2,14)
230 NEXT J
240 T=O
250 P=O
260 Q=O
270 M=O
280 CALL CLEAR
290 PRINT " AIR DEFENSE“
300 PRINT
310 PRINT
320 PRINT
330 PRINT " do you need instructions ?"
340 PRINT
350 PRINT " type Y or N"
360 FOR I=1 TD 7
370 PRINT
380 NEXT I
190 CALL KEY(3,Y,STATUS)
400 IF STATUS=O THEN 390
410 IF Y=ASC("N")THEN 750
420 IF Y=ASC("Y")THEN 520
430 CALL CLEAR
440 PRINT
450 PRINT " you did not press Y or N."
460 FOR I=1 TD 13
470 PRINT
480 NEXT I
490 FOR DELAY=1 TO 500

 

Still some trouble with zero versus letter O, and O versus D.



#12 matthew180 OFFLINE  

matthew180

    River Patroller

  • Topic Starter
  • 2,544 posts
  • Location:Castaic, California

Posted Thu Jan 4, 2018 3:31 PM

Nice.  That is much better than anything I have tried so far.  Thanks for the command line parameters, I might mess around with this a little more tonight.



#13 mizapf OFFLINE  

mizapf

    River Patroller

  • 3,393 posts
  • Location:Germany

Posted Thu Jan 4, 2018 3:48 PM

The trick is indeed to blur the picture suitably, maybe to apply some contrast afterwards. As I said, there are still some empty lines, but this is not too difficult to cope with.

 

The language given to tesseract also determines the valid characters. In particular, when I want to make it recognize German, it has to check for umlaut characters (ä, ö, ü) which could be seen as a, o, u with some speckles in English. The same goes for French, Spanish etc. So it is not necessarily a question of vocabulary.



#14 Schmitzi OFFLINE  

Schmitzi

    River Patroller

  • 4,391 posts
  • ToXiC
  • Location:Germany

Posted Thu Jan 4, 2018 3:54 PM

Paperport 14.1 (no special settings, "many" errors)

 

Spoiler


#15 Schmitzi OFFLINE  

Schmitzi

    River Patroller

  • 4,391 posts
  • ToXiC
  • Location:Germany

Posted Thu Jan 4, 2018 6:09 PM

AIRDEFENSE from the big PDF:

 

So here we go.

I had to change the variable ´DIGIT´ to ´DIGIT2´ in Lines 2530 + 2540.

Don´t ask my why, and don´t ask me about more impacts :)

 

AND, the Space Bar does not fire, and my basic is very rusty. Maybe somebody has an idea... ?

If found, I can make a DSK or TIFILE from it.

 

Let´s have fun :)

 

Textfile: Attached File  AirDefense-TI994A-TI-Basic.txt   5.95KB   1 downloads

 

Code:

Spoiler

 

Attached File  AirDefense-Screen.JPG   42.77KB   0 downloads



#16 Opry99er OFFLINE  

Opry99er

    Quadrunner

  • 9,898 posts
  • Location:Hustisford, WI

Posted Thu Jan 4, 2018 6:20 PM

Sweet!!!

#17 matthew180 OFFLINE  

matthew180

    River Patroller

  • Topic Starter
  • 2,544 posts
  • Location:Castaic, California

Posted Thu Jan 4, 2018 11:49 PM

Nice job!  Did you type that in or OCR the PDF?



#18 ralphb OFFLINE  

ralphb

    Dragonstomper

  • 574 posts
  • Location:Germany

Posted Fri Jan 5, 2018 6:36 AM

About ten years ago, I signed up for the OmniScan SDK, which even required signing an NDA.  My idea was to create a BASIC "language" that would run after the raw recognition to remove the most obvious errors.

 

For each letter on the page, OmniScan creates a list of potential glyphs, together with their probabilities.  In theory, you could select those glyphs with the highest probability that made sense according to a BASIC grammar.  To distinguish variables, say A0$ from AO$, you'd need even more heuristics.

 

All in all, this seemed too time-consuming to me, so it never went anywhere.



#19 Schmitzi OFFLINE  

Schmitzi

    River Patroller

  • 4,391 posts
  • ToXiC
  • Location:Germany

Posted Fri Jan 5, 2018 10:43 AM

I made screenshot-snips from the PDF, to have the right sequenze.

Then I made a new, searchable PDF with OmniPage, and copied the text into a TXT-file.

hen I had manually to check and correct against the original PDF. was some work :)

 

Please do not forget my comment that fire does not work (or don´t I know how to play it ?)

And the mystery with the DIGIT variable, I would like to know the reason



#20 OLD CS1 OFFLINE  

OLD CS1

    Technomancer

  • 5,683 posts
  • Technology Samurai
  • Location:Tallahassee, FL

Posted Fri Jan 5, 2018 6:51 PM

What engine does Neat Receipts use?  Supposedly that system is very accurate with receipts, which are far worse than magazine listings.






0 user(s) are browsing this forum

0 members, 0 guests, 0 anonymous users