+Larry Posted January 16, 2021 Share Posted January 16, 2021 I have a new project -- downloading Current Notes, and at archive.org one of the download options is "PDF with Text." Sometimes the size difference is not too much. Other times, it is huge. One example is a PDF of 120+ MB and the PDF with Text was about 6 MB. As far as I can tell, they have the same content, and the "with Text" does contain graphics. I know that PDF files can be gigantic and plain text is much, much smaller, but what is the "magic" that allows the "with Text" to be so much smaller? (One of my pet peeves is that folks frequently overdo the DPI when doing OCR of mainly Black and White, mostly text magazines.) Can someone explain this? Quote Link to comment Share on other sites More sharing options...
Mathy Posted January 16, 2021 Share Posted January 16, 2021 Hello Larry My guess would be that you can use the search text feature to find something in a PDF "with text" and that a plain PDF would just look the same but in reality just be a picture. Sincerely Mathy 1 Quote Link to comment Share on other sites More sharing options...
+DrVenkman Posted January 16, 2021 Share Posted January 16, 2021 I have no affiliation with or connection to the files you're talking about specifically, but in general, remember that "PDF" is merely the de facto standard "Portable Document Format." Any particular PDF file can be created and prepared in a number of different ways and still be read by the same software. A PDF with text can be smaller because each page is essentially just a format descriptor + text, relying on the reader's software to format and display each page. The much larger versions of the same files are instead a collection of much larger graphics (usually scans or software-generated TIF images), typically one or more per page, depending on the formatting of the original and how complicated the layout. From my personal, professional experience (I typical read, create, edit and/or review many dozens or hundreds of PDFs per day as part of my job). The embedded-text files are convenient if you need to copy/paste text but do rely on the nuances and quirks of the user's PDF reader/renderer to format each page properly. Sometimes there may be weird font/rending problems, depending on who created the original, if the same fonts are available to the reader. Sometimes the PDF itself has embedded fonts to work around that problem, most times not, at least in modern times and the ubiquity of Unicode fonts. That said, a large "image-only" PDF will render and view properly on almost every device and operating system out there. 1 Quote Link to comment Share on other sites More sharing options...
bfollowell Posted January 16, 2021 Share Posted January 16, 2021 2 hours ago, Mathy said: My guess would be that you can use the search text feature to find something in a PDF "with text" and that a plain PDF would just look the same but in reality just be a picture. This. Many PDFs are just images bundled in a PDF wrapper, while many have been OCRed and have real searchable text embedded along with the images. 1 1 Quote Link to comment Share on other sites More sharing options...
+Allan Posted January 16, 2021 Share Posted January 16, 2021 I scanned many of these PDFs. The original file is a PDF with tiff picture scans inside. Then Archive.org takes the PDF and makes a number of other file types from it. I do not do any text conversion when I do the scans. Quote Link to comment Share on other sites More sharing options...
Atari_Ace Posted January 16, 2021 Share Posted January 16, 2021 archive.org has some kind of heuristic it uses to determine if the text layer in a pdf is intact and comprehensive, and if it isn't, it will reprocess the pdf file (assuming pdf is the original upload source) and generate the "with text" version. In the past it was using ABBY FineReader 11 for the reprocessing, which sometimes is able to massively shrink the input pdf, but they seem to have dropped that recently (they're using Tesseract now for OCR). I've also wondered about what that heuristic is, since it sometimes triggers on pdfs that have significant amounts of text in the file. I suspect this "with text" feature was added as a replacement for the djvu files they used to include. I miss the djvu files, as they tended to be smaller and render much faster than the pdfs, almost always at the same level of quality. 1 Quote Link to comment Share on other sites More sharing options...
KG7PFS Posted January 16, 2021 Share Posted January 16, 2021 Based on my experience with magazines downloaded from the archive, I'd say PDF looks like a scanned magazine, while PDF with text looks like a blurred blob. It's worth the extra size. 1 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.