Jump to content
IGNORED

What is a PDF with text?


Larry

Recommended Posts

I have a new project -- downloading Current Notes, and at archive.org one of the download options is "PDF with Text."  Sometimes the size difference is not too much.  Other times, it is huge.  One example is a PDF of 120+ MB and the PDF with Text was about 6 MB.  As far as I can tell, they have the same content, and the "with Text" does contain graphics.  I know that PDF files can be gigantic and plain text is much, much smaller, but what is the "magic" that allows the "with Text" to be so much smaller?  (One of my pet peeves is that folks frequently overdo the DPI when doing OCR of mainly Black and White, mostly text magazines.)

 

Can someone explain this?

 

 

Link to comment
Share on other sites

I have no affiliation with or connection to the files you're talking about specifically, but in general, remember that "PDF" is merely the de facto standard "Portable Document Format." Any particular PDF file can be created and prepared in a number of different ways and still be read by the same software. A PDF with text can be smaller because each page is essentially just a format descriptor + text, relying on the reader's software to format and display each page. The much larger versions of the same files are instead a collection of much larger graphics (usually scans or software-generated TIF images), typically one or more per page, depending on the formatting of the original and how complicated the layout. 

 

From my personal, professional experience (I typical read, create, edit and/or review many dozens or hundreds of PDFs per day as part of my job). The embedded-text files are convenient if you need to copy/paste text but do rely on the nuances and quirks of the user's PDF reader/renderer to format each page properly. Sometimes there may be weird font/rending problems, depending on who created the original, if the same fonts are available to the reader. Sometimes the PDF itself has embedded fonts to work around that problem, most times not, at least in modern times and the ubiquity of Unicode fonts. That said, a large "image-only" PDF will render and view properly on almost every device and operating system out there.

  • Thanks 1
Link to comment
Share on other sites

2 hours ago, Mathy said:

 

My guess would be that you can use the search text feature to find something in a PDF "with text" and that a plain PDF would just look the same but in reality just be a picture.

 

This.

 

Many PDFs are just images bundled in a PDF wrapper, while many have been OCRed and have real searchable text embedded along with the images.

 

  • Like 1
  • Thanks 1
Link to comment
Share on other sites

archive.org has some kind of heuristic it uses to determine if the text layer in a pdf is intact and comprehensive, and if it isn't, it will reprocess the pdf file (assuming pdf is the original upload source) and generate the "with text" version.  In the past it was using ABBY FineReader 11 for the reprocessing, which sometimes is able to massively shrink the input pdf, but they seem to have dropped that recently (they're using Tesseract now for OCR).

 

I've also wondered about what that heuristic is, since it sometimes triggers on pdfs that have significant amounts of text in the file.

 

I suspect this "with text" feature was added as a replacement for the djvu files they used to include.  I miss the djvu files, as they tended to be smaller and render much faster than the pdfs, almost always at the same level of quality.

 

 

  • Like 1
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Loading...
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...