Jump to content
IGNORED

OCR for Code?


matthew180

Recommended Posts

I found some old program printouts, nice 9-pin dot-matrix all-caps 99/4A BASIC-code goodness. I took some nice high-res photos and tried running them through every OCR program (online and offline) I could find, but they all fail miserably. All the software seems to try and determine the language and make sense of the code as English (or whatever language it auto-detects when that fails). Does anyone know of a *dumb* OCR program that will just worry about individual characters, or any OCR designed for technical documentation, code, etc.?

  • Like 1
Link to comment
Share on other sites

I used to work for a guy who ran a data archiving warehouse. Companies would send documents in crates, we would scan them in and then run them through an OCR program. IIRC, there was very little error checking on the OCR, as he would send the companies both the scans and the searchable OCR docs. IIRC, it was called Abby or Addy or something of the like. Pretty sure it was commercial though.

Link to comment
Share on other sites

I looked it up.. it was Abbyy.

 

This was in about 2007, so it would have been an older piece of software... they have more modern software available but its like $150 to purchase. Yikes....

 

Might find one of the older programs in public domain though. I cannot remember the actual software name, even after looking at the ones from that era.

Link to comment
Share on other sites

@Opry99er, Abbyy is one of the first programs to come up when you search for OCR, and their online service failed miserably, just like Google, and everyone else. Right now I only have one program that I found in one of my old COMPUTE! magazines. Apparently I made some fixes and changes though, since there is some of my chicken-scratch in the magazine. If you really want to type in the program, here is the link to the edition with the code (starts on pg 44):

 

https://archive.org/details/1983-04-compute-magazine

 

I really love the Internet Archive! (and yes I donate to it.) I'm actually considering getting rid of my paper magazines because they have archived everything I have in paper form. Anyway, I found the print-out tucked in the magazine and thought "gee, I should be able to OCR this in seconds these days, and I can put it up on AtariAge for people who like the old BASIC games..." Two hours later and I was still trying to find *any* software that could OCR the code and just leave it alone.

 

@mizapf, Tesseract is the only program I did not try directly yet, mostly because it was not "quick and simple" and a lot of what I read is that it is a pain in the ass to use, you have to train it, blah blah blah. Also, apparently, it is the engine used in things like Google's OCR and such, and that failed for me too. However, I looked at the command line parms and you still have to specify a language, so I assume it will fail just as badly as everything else because it will be trying to find English words, which code it not.

 

This was not supposed to be a huge endeavor, and I have already wasted more time trying to get it to work than it would have taken me to just type it again. Why is that always the way it goes when you are trying to do something you think should be simple by now?

 

I have attached the first page if anyone wants to give it a try. I cropped the text and reduced the color to 1-bit black and white, otherwise the image is unmodified. I read that for OCR, characters need to be at least 10-pixels tall, which is easily the case here. After zooming in, I now see maybe the problem is too much data and separation between the dots from the printer. Maybe reducing the image to get the dots to merge would help? I don't know. Again the goal was quick-and-dirty OCR, which I did not achieve.

post-24952-0-74083900-1515094244_thumb.png

  • Like 1
Link to comment
Share on other sites

I didnt know Abbyy had an online OCR. Thats cool, but if it doesnt work, kind of useless to us.

 

Perhaps their commercial software would be more versatile/useful... it is expensive though.

 

At that document warehouse, it was a large printer, scanner, copier deal. I wasnt responsible for the scanning, I just hauled boxes from the 3 level warehouse off a pick list to the operator. It scanned directly to a computer where it was processed, then the documents were stored for 2 months and then we destroyed boxes of documents by the pallet-load.

Link to comment
Share on other sites

I used this metod a pair of times and anyway needed to fix all errors after. i still haven't found a perfect one.

Anyway Acrobat help me a bit.

 

This is the result using the lowres image you attached converted using OCR of Acrobat Reader (it is not free of course :( )

post-24952-0-74083900-1515094244_thumb.pdf

converted in txt.txt

Edited by ti99iuc
Link to comment
Share on other sites

OK, I tried tesseract after applying a blur filter in Gimp to smear the pixels, then save as grayscale in jpg.

 

 

100 DIM BLOCK$(2),PLACE(2),BUILDING(32 2)
110 RANDOMIZE 5
120 REM BOMB CHARACTER
1 30 CALL CHAR ( 1 29 , " 00 1 CBEFF F FEE 1 COO “ )
140 REM CROSSHAIR CHAR
150 CALL CHAR(130,"181818FFFF181818")
160 CALL CLEAR
170 CALL SCREEN(12)
180 FOR J=5 TO 8
190 CALL CDLDRiJ,5,16)
200 NEXT J
210 FOR J=9 TD 12
220 CALL CULUR(J,2,14)
230 NEXT J
240 T=G
250 P=0
250 9:0
270 "=0
280 CALL CLEAR
290 PRINT " AIR DEFENSE"
300 PRINT
310 PRINT
320 PRINT
330 PRINT " do you need instructions ?"
340 PRINT
350 PRINT " type Y or N"
360 FOR 1:1 TD 7
370 PRINT
380 NEXT I
390 CALL KEY(3,Y,STATUS)
400 IF STATUS=0 THEN 390
410 IF Y=ASC("N“)THEN ?50
420 IF Y=ASC£"Y")THEN 520
430 CALL CLEAR
440 PRINT
450 PRINT “ you did not press Y OF N."
450 FOR I=1 TB 13
470 PRINT
480 NEXT I
490 FOR DELAY=1 TO 500
Link to comment
Share on other sites

BTW, I mentioned tesseract because I was highly surprised by its good recognition performance. I OCRed the HFDC manual that you can find on Whtech, and I also did that to the Editor/Assembler manual with only very little errors (less than 10 on a whole page).

 

Apart from the extra work with gimp (which could be done automatically by ImageMagick in the command line), I just ran

 

 

$ tesseract post.jpg listing

 

One thing: There are lots of empty lines which are not shown in the listing above.

Link to comment
Share on other sites

OK, another try with ImageMagick. I had to experiment with the blur vector.

 

 

$ convert -blur 11x2 post0.png post1.jpg
$ tesseract post1.jpg listing

 

yields this result:

 

 

100 DIM BLOCK$(2),PLACE(2),BUILDING(32 2)
110 RANDOMIZE 5
120 REM BOMB CHARACTER
130 CALL CHAR(129,"OOICBEFFFFBE1C00")
140 REM CROSSHAIR CHAR
150 CALL CHAR(130,"181818FFFF181818")
160 CALL CLEAR
170 CALL SCREEN(12)
180 FOR J=q TO 8
190 CALL COLDR(J,5,16)
200 NEXT J
210 FOR J=9 TO 12
220 CALL COLOR(J,2,14)
230 NEXT J
240 T=O
250 P=O
260 Q=O
270 M=O
280 CALL CLEAR
290 PRINT " AIR DEFENSE“
300 PRINT
310 PRINT
320 PRINT
330 PRINT " do you need instructions ?"
340 PRINT
350 PRINT " type Y or N"
360 FOR I=1 TD 7
370 PRINT
380 NEXT I
190 CALL KEY(3,Y,STATUS)
400 IF STATUS=O THEN 390
410 IF Y=ASC("N")THEN 750
420 IF Y=ASC("Y")THEN 520
430 CALL CLEAR
440 PRINT
450 PRINT " you did not press Y or N."
460 FOR I=1 TD 13
470 PRINT
480 NEXT I
490 FOR DELAY=1 TO 500

 

Still some trouble with zero versus letter O, and O versus D.

Link to comment
Share on other sites

The trick is indeed to blur the picture suitably, maybe to apply some contrast afterwards. As I said, there are still some empty lines, but this is not too difficult to cope with.

 

The language given to tesseract also determines the valid characters. In particular, when I want to make it recognize German, it has to check for umlaut characters (ä, ö, ü) which could be seen as a, o, u with some speckles in English. The same goes for French, Spanish etc. So it is not necessarily a question of vocabulary.

Link to comment
Share on other sites

Paperport 14.1 (no special settings, "many" errors)

 

 

 

100 DIM BLOCK$(2),PLACE(2),BUILD1NG(32,2)
110 RANDOMIZE
120 REM BOMB CHARACTER
130 CALL CHAR(129,"001CBEFFFFBEIC00")
140 REM CROSSHAIR CHAR
150 CALL CHAR(130,"181818FFFF181818")
160 CALL CLEAR
170 CALL SCREEN (I2)
180 FOR 3=5 TO 8
190 CALL COLOR(J,506)
200 NEXT J
210 FOR J=9 TO 12
220 CALL COLOR(3,2,14)
230 NEXT
240 T=0
250 P=0
260 Q=0
270 M=0
280 CALL CLEAR
290 PRINT AIR DEFENSE"
300 PRINT
310 PRINT
32 PRINT
3W0 PRINT " do you need instructions ?
"
340 PRINT
350 PRINT type or N"
360 FOR :1=1 TO 7
370 PRINT
380 NEXT
390 CALL KEY(3,Y,STATUS)
400 IF STATUS=0 THEN 390
410 IF Y=ASC("N")THEN 750
420 IF Y=ASC("Y")THEN 520
430 CALL CLEAR
440 PRINT
450 PRINT you did not press Y or Nm"
460 FOR 1=1 TO 13
470 PRINT
480 NEXT
490 FOR DELAY=I TO 500

 

 

Link to comment
Share on other sites

AIRDEFENSE from the big PDF:

 

So here we go.

I had to change the variable ´DIGIT´ to ´DIGIT2´ in Lines 2530 + 2540.

Don´t ask my why, and don´t ask me about more impacts :)

 

AND, the Space Bar does not fire, and my basic is very rusty. Maybe somebody has an idea... ?

If found, I can make a DSK or TIFILE from it.

 

Let´s have fun :)

 

Textfile: AirDefense-TI994A-TI-Basic.txt

 

Code:

 

 

 

90 REM AIR DEFENSE BY T.L.WAHL
91 TI-BASIC OR EXTENDED BASIC

100 DIM BLOCKS(2),PLACE(2),BUILDING(32,2)
110 RANDOMIZE
120 REM BOMB CHARACTER
130 CALL CHAR(129,"001CBEFFFFBE1C00")
140 REM CROSSHAIR CHARACTER
150 CALL CHAR(130,"181818FFFF181818")
160 CALL CLEAR
170 CALL SCREEN(12)
180 FOR J=5 TO 8
190 CALL COLOR(J,5,16)
200 NEXT J
210 FOR J=9 TO 12
220 CALL COLOR(J,2,14)
230 NEXT J
240 T=0
250 P=0
260 Q=0
270 M=0
280 CALL CLEAR
290 PRINT " AIR DEFENSE"
300 PRINT
310 PRINT " BY T.L. WAHL"
320 PRINT
321 PRINT
322 PRINT

330 PRINT " do you need instructions ?"
340 PRINT
350 PRINT " type Y or N"
360 FOR I=1 TO 7
370 PRINT
380 NEXT I
390 CALL KEY(3,Y,STATUS)
400 IF STATUS=0 THEN 390
410 IF Y=ASC("N") THEN 750
420 IF Y=ASC("Y") THEN 520
430 CALL CLEAR
440 PRINT
450 PRINT " you did not press Y or N."
460 FOR I=1 TO 13
470 PRINT
480 NEXT I
490 FOR DELAY=1 TO 500
500 NEXT DELAY
510 GOTO 280
520 CALL CLEAR
530 PRINT " YOU MUST STOP THE FALLING"
540 PRINT " BOMB BY EXPLODING IT IN"
550 PRINT " MID-AIR"
560 PRINT
570 PRINT " -MOVE THE CROSSHAIR-"
580 PRINT
590 PRINT " left :HOLD THE s KEY"
600 PRINT " right:HOLD THE d KEY"
610 PRINT " up :HOLD THE e KEY"
620 PRINT " down :HOLD THE x KEY"
630 PRINT
640 PRINT "WHEN THE BOMB AND THE"
650 PRINT "CROSSHAIR ARE LINED UP,"
660 PRINT "FIRE BY PRESSING THE SPACE"
670 PRINT "BAR. THE SOONER YOU BET THE"
680 PRINT "BOMB THE HIGHER YOUR SCORE."
690 PRINT
700 PRINT
710 PRINT
720 PRINT " PRESS any key T0 BEGIN"
730 CALL KEY(0,S,STATUS)
740 IF STATUS=0 THEN 730
750 CALL CLEAR
760 CALL COLOR(8,2,1)
770 PRINT " GOOD LUCK!!!"
780 FOR I=1 TO 10
790 PRINT
800 NEXT I
810 IF R=ASC("R") THEN 840
820 GOSUB 2090
830 GOTO 860
840 FOR I=1 TO 250
850 NEXT I
860 CALL CLEAR
870 GOSUB 2300
880 IF T=20 THEN 1860
890 T=T+1
900 CCROSS=16
910 RCROSS=21
920 RBOMB=1
930 CALL SCREEN(6)
940 CBOMB=INT(RND*29)+2
950 H$=STR$(T)
960 ROW=2
970 COL=3
980 GOSUB 2520
990 SCORE=P*Q*10
1000 H$=STR$(SCORE)
1010 ROW=5
1020 GOSUB 2520
1030 FOR I=1 TO 70
1040 NEXT I
1050 FOR I=2 TO 5 STEP 3
1060 CALL HCHAR(I,3,32,6)
1070 NEXT I
1080 OLDRCROSS=RCROSS
1090 OLDCCROSS=CCROSS
1100 CALL KEY(0,A,STATUS)
1110 IF A<>ASC("E") THEN 1130
1120 RCROSS=RCROSS-SGN(RCROSS-1)
1130 IF A<>ASC("X") THEN 1150
1140 RCROSS=RCROSS+SGN(22-RCROSS)
1150 IF A<>ASC("D") THEN 1170
1160 CCROSS=CCROSS+SGN(31-CCROSS)
1170 IF A<>ASC("S") THEN 1190
1180 CCROSS=CCROSS-SGN(CCROSS-2)
1190 IF RBOMB=1 THEN 1210
1200 CALL VCHAR(RBOMB-1,CBOMB,32)
1210 IF (RCROSS=OLDRCROSS)*(CCROSS=OLDCCROSS) THEN 1230
1220 CALL VCHAR(OLDRCROSS,OLDCCROSS,32)
1230 CALL VCHAR(RCROSS,CCROSS,130)
1240 CALL VCHAR(RBOMB,CBOMB,129)
1250 RBOMB=RBOMB+1
1260 IF RBOMB=23 THEN 1540
1270 IF (RCROSS=RBOMB-1)*(CCROSS=CBOMC) THEN 1290
1280 GOTO 1080
1290 CALL KEY(0,B,STATUS)
1300 IF B=32 THEN 1330
1310 GOTO 1080
1320 REM BOMB DESTROYED
1330 RBOMB=RBOMB-1
1340 CALL SCREEN(10)
1350 CALL VCHAR(RBOMB,CBOMB,32)
1360 CNT=0
1370 C1=92
1380 C2=47
1390 FOR I=-1 TO 1 STEP 2
1400 CALL VCHAR(RBOMB+I,CBOMB+I,C1)
1410 CALL VCHAR(RBOMB+I,CBOMB-I,C2)
1420 NEXT I
1430 C1=32
1440 C2=32
1450 IF CNT=1 THEN 1510
1460 CNT=1
1470 FOR VOL=10 TO 30 STEP 5
1480 CALL SOUND(100,-6,VOL)
1490 NEXT VOL
1500 GOTO 1390
1510 P=P+1
1520 Q=Q+(23-RBOMB)
1530 GOTO 880
1540 REM BOMB HITS THE CITY
1550 CALL VCHAR(22,CBOMB,32)
1560 CALL SCREEN(9)
1570 CALL COLOR(12,11,1)
1580 CALL VCHAR(23,CBOMB-1,122)
1590 CALL VCHAR(23,CBOMB,32)
1600 CALL VCHAR(23,CBOMB+1,123)
1610 CALL VCHAR(24,CBOMB-1,124)
1620 CALL VCHAR(24,CBOMB,125)
1630 CALL VCHAR(24,CBOMD+1,126)
1640 FOR I=1 TO 20
1650 NEXT I
1660 CALL COLOR(12,7,1)
1670 CALL SCREEN(12)
1680 FOR I=1 TO 20
1690 NEXT I
1700 CALL SCREEN(7)
1710 FOR VOL=24 TO 1 STEP 4
1720 CALL SOUND(200 -7,VOL)
1730 NEXT VOL
1740 FOR DVOL=1 TO 24 STEP 4
1750 CALL SOUND(200,-7,DVOL)
1760 NEXT DVOL
1770 FOR J=23 TO 24
1780 FOR I=CBOMB-1 TO CBOMB+1
1790 CALL VCHAR(J,I,32)
1800 NEXT I
1810 NEXT J
1820 CALL VCHAR(RCROSS,CCROSS,32)
1830 CALL COLOR(12,2,14)
1840 M=M+1
1850 GOTO 880
1860 CALL CLEAR
1870 CALL SCREEN(4)
1880 CALL COLOR(8,5,16)
1990 PRINT " GAME OVER"
1900 FOR I=1 TO 4
1910 PRINT
1920 NEXT I
1930 PRINT " DESTROYED ";P
1940 PRINT
1950 PRINT " MISSED ";M
1960 PRINT
1970 PRINT " TOTAL POINTS";P*Q*10
1980 FOR I=1 TO 4
1990 PRINT
2000 NEXT I
2010 PRINT " PRESS r TO PLAY AGAIN"
2020 PRINT
2030 PRINT
2040 CALL KEY(0,R,STATUS)
2050 IF STATUS=0 THEN 2040
2060 IF R=ASC("R") THEN 160
2070 END
2080 REM READ CITY DATA
2090 FOR ROW=2 TO 1 STEP -1
2100 FOR COL=1 TO 32
2110 READ BUILDING(COL,ROW)
2120 NEXT COL
2130 NEXT ROW
2140 REM CUSTOM CHAR & COLORS
2150 CALL CHAR(136,"FFABFFABFFABFFFF")
2160 CALL CHAR(128,"003C7EFFFFFF7E42")
2180 CALL CHAR(132,"6060606060606060")
2190 CALL CHAR(133,"607858F8D8F8D8F8")
2200 CALL CHAR(134,"F8A8F8A8F8A8F8F8")
2210 CALL CHAR(135,"C3C3FFABFFABFFFF")
2220 CALL COLOR(14,7,12)
2230 CALL CHAR(122,"8040201008040201")
2240 CALL CHAR(123,"0102040810204080")
2250 CALL CHAR(124,"80E0F8FEFFFFFFFF")
2260 CALL CHAR(125,"814224180081C3E7")
2270 CALL CHAR(126,"01071F7FFFFFFFFF")
2280 RETURN
2290 REM SET UP CITY
2300 FOR ROW=2 TO 1 STEP -1
2310 FOR COL=1 TO 32
2320 BLOCK$(ROW)=BLOCK$(ROW)&CHR$(BUILDING(COL,ROW))
2330 NEXT COL
2340 NEXT ROW
2350 FOR ROW=2 TO 1 STEP -1
2360 FOR COL=1 TO 32
2370 PLACE(ROW)=ASC(SEG$(BLOCK$(ROW),COL,1))
2380 CALL HCHAR(ROW+22,COL,PLACE(ROW))
2390 NEXT COL
2400 NEXT ROW
2410 RETURN
2420 REM CITY DATA
2430 DATA 136,134,131,135,133,136,136,133
2440 DATA 135,136,136,136,133,136,136,135
2450 DATA 135,136,136,134,133,136,136,136
2460 DATA 135,132,136,32,131,135,132,135
2470 DATA 134,133,128,32,132,32,135,32
2480 DATA 32,32,134,132,132,32,133,32
2490 DATA 32,32,128,32,132,32,133,135
2500 DATA 32,132,132,32,128,32,132,32
2510 REM HORIZONTAL # PRINTER
2520 FOR I=1 TO LEN(H$)
2530 DIGIT2=ASC(SEG$(H$,I,1))
2540 CALL HCHAR(ROW,COL+I,DIGIT2)
2550 NEXT I
2560 RETURN

 

 

post-41141-0-28000000-1515111300_thumb.jpg

  • Like 1
Link to comment
Share on other sites

About ten years ago, I signed up for the OmniScan SDK, which even required signing an NDA. My idea was to create a BASIC "language" that would run after the raw recognition to remove the most obvious errors.

 

For each letter on the page, OmniScan creates a list of potential glyphs, together with their probabilities. In theory, you could select those glyphs with the highest probability that made sense according to a BASIC grammar. To distinguish variables, say A0$ from AO$, you'd need even more heuristics.

 

All in all, this seemed too time-consuming to me, so it never went anywhere.

  • Like 2
Link to comment
Share on other sites

I made screenshot-snips from the PDF, to have the right sequenze.

Then I made a new, searchable PDF with OmniPage, and copied the text into a TXT-file.

hen I had manually to check and correct against the original PDF. was some work :)

 

Please do not forget my comment that fire does not work (or don´t I know how to play it ?)

And the mystery with the DIGIT variable, I would like to know the reason

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...