Home Machine Learning Get Underlined Textual content from Any PDF with Python | by Sasha Korovkina | Could, 2024

Get Underlined Textual content from Any PDF with Python | by Sasha Korovkina | Could, 2024

0
Get Underlined Textual content from Any PDF with Python | by Sasha Korovkina | Could, 2024

[ad_1]

A step-by-step information to get underlined textual content as an array from PDF information.

💡 If you wish to see the code for this mission, take a look at my repository: https://github.com/sasha-korovkina/pdfUnderlinedExtractor

PDF knowledge extraction generally is a actual headache, and it will get even trickier once you’re attempting to snag underlined textual content — imagine it or not, there aren’t any go-to options or libraries that deal with this out of the field. However don’t fear, I’m right here to indicate you how one can sort out this.

Picture by dlxmedia.hu on Unsplash

The Principle

Extracting underlined textual content from PDFs can take a couple of totally different paths. You may think about using OCR to detect textual content elements with backside strains or delve into PyMuPDF’s markup capabilities. Nevertheless, I’ve discovered that OCR tends to falter, affected by inconsistency and low accuracy. PyMuPDF isn’t my favourite both — it calls for finicky parameter tuning, which is time-consuming. Plus, one fallacious setting and you might lose a bunch of information.

You will need to do not forget that PDFs are:

  • Non-Structured Information: PDF components usually lack grouping or categorization, which complicates efforts to look via the content material systematically.
  • Textual content Formatting Recognition: Detecting particular textual content codecs akin to daring or underlined is notoriously troublesome in PDFs, as most Python libraries don’t assist this functionality successfully.

However concern not, as we now have a technique to resolve this.

The Technique

  • Convert the PDF to Structured XML: Begin by reworking the PDF doc right into a structured XML format to facilitate simpler knowledge manipulation.
  • Extract Desired Parts: Establish and isolate the precise elements from the XML which are related to our wants.
  • Use OCR (Optical Character Recognition) on the extracted coordinates to get the underlined textual content knowledge as an array.
  • Extract and Output Underlined Textual content: Lastly, extract the underlined textual content from the doc and show or print the outcomes.

The Code

  1. PDF to XML

We are going to use the pdfquery library, essentially the most complete PDF to XML converter which I’ve come throughout.

2. Finding out the XML

The XML has a couple of key elements which we’re curious about:

  • LTRect — typically, the library would parse the underlined textual content as a rectangle of minimal width beneath the textual content
  • LTLine — different occasions, it might recognise the define as a separate line part.
That is what your output XML will appear to be. Picture created by writer.

LTRect part instance:

<LTRect y0="563.787" y1="629.964" x0="367.942" x1="473.826" width="105.884" peak="66.178" bbox="[367.942, 563.787, 473.826, 629.964]" linewidth="0" pts="[[367.942, 629.964], [473.826, 629.964], [473.826, 563.787], [367.942, 563.787]]">

Subsequently, by changing the entire doc into XML format, we are able to replicate it’s construction as XML elements, let’s do exactly that!

Construction Replication

Now, we’ll re-create the construction of our doc as bounding field coordinates. To do that, we’ll parse the XML to outline the web page, part bins, strains and rectangles, after which draw all of them on our canvas in 3 totally different colours.

PDF object visualization.

Right here is our inital PDF, it has been generated in Microsoft Phrase, by exporting a doc with some underlines to the PDF file format:

Preliminary doc with pattern textual content. Picture created by writer.

After making use of the algorithm above, right here is the visible illustration we get:

The field define of the doc Black — all doc elements, blue — underlined textual content. Picture created by writer.

This picture represents the construction of our doc, the place the black field is used to explain all elements on the web page, and the blue is used to explain the LTRect components, therefore the underlined textual content.

Textual content Overlay

Now, let’s visualize the entire textual content throughout the PDF in it’s respective positions, with the next line of code:

can.drawString(text_x, text_y, textual content)

Right here is the output:

PDF re-creation based mostly on textual content location and underlines. Picture created by writer.

Be aware that the textual content shouldn’t be precisely the place it was within the unique doc, because of the distinction in dimension and font of the mark-up language within the pdfquery library.

Co-Ordinate Extraction

As the results of our XML, we can have an array of coordinates of underlined areas, in my case I’ve known as it underline_text.

A chunk of code which kinds an array of coordinates of underlined textual content throughout the PDF file.

Textual content Extraction

Right here’s the method:

  1. We establish the coordinate rectangles as beforehand decided.
  2. We extract these sections from the PDF.
  3. We apply Tesseract OCR to extract textual content from every extracted part.

This methodology of extracting textual content from PDFs utilizing coordinate rectangles and Tesseract OCR is efficient for a number of causes:

  1. Precision in Textual content Extraction: By figuring out particular coordinate rectangles, the method targets solely related areas of the PDF. This centered strategy avoids pointless processing of your complete doc and reduces errors associated to extracting undesirable textual content.
  2. Effectivity: Extracting predefined sections instantly from the PDF is way quicker than processing your complete doc. This methodology saves computational sources and time, significantly helpful when coping with massive paperwork.
  3. Accuracy with OCR: Tesseract OCR is a strong optical character recognition device that may convert pictures of textual content into machine-readable textual content. By feeding it exact sections of textual content, it may carry out extra precisely because it offers with much less background noise and formatting points which may confuse the OCR course of in bigger, unsegmented paperwork.

And that is the code:

Code to extract underlined textual content from the PDF sections.

Just remember to have tesseract put in in your system earlier than operating this operate. For in-depth directions, take a look at their official set up information right here: https://github.com/tesseract-ocr/tessdoc/blob/primary/Set up.md or in my GitHub repository right here: https://github.com/sasha-korovkina/pdfUnderlinedExtractor.

Placing It All Collectively…

Now, If we take any PDF file, like this instance file:

The entire textual content of the check file. Picture created by writer.

We’ve got some underlined phrases on this file:

ipsum and laboris are underlined right here. Picture created by writer.

After operating the code described above, here’s what we get:

An array of all underlined phrases within the doc. Picture created by writer.

After getting this array, you should utilize these phrases for additional processing!

Take pleasure in utilizing this script! I’d love to listen to about any inventive purposes you provide you with or in case you’d prefer to contribute. Let me know! ❤️

[ad_2]