Article

Obsidian Source: Notes / Docs - Table Parser

Summary

Pending synthesis from local Obsidian source.

Original source title: Docs Table Parser

Extracted Preview

Need to document what we did for table parser. I'm the POC if anything works/fails. The testing team will give me edge cases where things might fail, and I need to find ways to rectify it.

  • Before that, I need to make a confluence KB for the table parser. I'll jot down some quick notes by seeing how to write and all.

What exactly is Confluence

KB for corporate. Read everything, work on it, team management blah blah.

Table Parser

The whole task of TableParser is to get the balance sheet , profit & loss and cash flow statements from the annual reports. The task in broken down into two steps, one is parsing(getting the data from the reports in correct format), and cleaning(which involves further cleaning to get in the desired format). Let's look at them step by step :

Libraries used : pymupdf, ultralyticsplus(for YOLO table extraction), img2table, pillow

pip install pymupdf ultralyticsplus img2table pillow

Parsing

The statements in the reports are messy, inconsistent and to get data from them using standard PDF parsers turned out to be in vain. The approach we took is to basically create a new page with the data from the reports and read from it. In our tests, we found out this approach worked by far the best and we were able to extract the statements accurately.

Approach

Integration Notes

  • Source folder: /home/yashs/Documents/Docs/Obsidian/Research-Notes
  • Local source: /home/yashs/Documents/Docs/Obsidian/Research-Notes/Notes/Docs - Table Parser.md
  • Raw copy: raw/obsidian/research-notes/Notes/Docs - Table Parser.md

Links Created Or Updated

Open Questions