The Ability to Read Text and Tables for Question Answering
The Secret is in the Architecture
When it comes to question answering applications, many options exist for parsing text, extracting information, and accessing that information later on. However, some NLP tools fail to consider if the information is contained in tables. Tables are essential to question-answering, as they contain repetitive information and serve as a short form of communication. For example, take the third row in the table below: the information it conveys is the same as the sentence “John’s salary in 2021 was $150,000”. Any text parsing tool that overlooks tables would be missing out on mission critical data.
How can NLP tools handle both prose and tabular information?
The idea is to represent knowledge from text and tables in the same format, to seamlessly combine and utilize them. At Lymba, we accomplish this using a powerful trio: an ontology, an NLP pipeline, and a knowledge graph.
For a new domain or application, we first define target classes and relationships in the ontology, so that any labels for named entities, semantic relations, and domain and relation range restrictions are agreed upon. This is important since even experts in the same field may use different terms for the same concept, such as "student" and "pupil" in the education domain. To know when we're referring to the same object, we must label concepts consistently. Ontologies represent the schema in standard format, thus making it easier for people to use, and easier for machine tools to parse and validate.
The ontology drives how the NLP pipeline populates the final knowledge graph. Starting with preprocessing, it recognizes lines, white space, and blocks of text on the page with low-level image processing and OCR tools. To read tables, it recognizes table borders, the table grid, and headers on the left and at the top of the table. Similar to the way text has underlying low-level linguistic relations – syntactic dependencies, sematic roles, etc., - table cells have grid-based relationships from the cells to headers and between headers on different levels. Additionally, it forms relationships to table captions, surrounding text, and potentially other layout elements, such as section titles and footnotes. These low-level relationships are used in our semantic calculus tool to produce target high-level relationships.
In a perfect world, every table would be as easy to read as the one above, but complexity can arise in unexpected ways. We have seen tables with merged cells and tables without outlined borders, multi-level hierarchical headers, multiple orientations of text in cells, text overflowing to neighboring cells, and headers expressing additional information about the cell. All of these seemingly minute differences make it difficult for a machine to understand.
Our pipeline architecture allows us to handle all kinds of table phenomena. A typical pipeline includes a set of smaller tools for every table element:
A tool that extracts lines
A tool that extracts compact blobs of text
A tool that recognizes the table grid given those lines and blobs of text
A tool for extracting named entities
A tool for recognizing target high-level relations
Now imagine that something unexpected arises, such as a table border outlined by symbols like ‘|’, ‘-’, ‘=’ and even ‘~’ – almost like ASCII art – as in the figure above. If we were using some mainstream deep learning architecture for question answering, we would have to collect more training data and re-train the model to handle this new kind of input. Lymba can minimize changes by editing or adding in individual modules in the pipeline. Specifically, we would just add a small tool that would look for text blobs with repeating symbols, take them out of the proper text blobs, and add them back in as lines. The pipeline upstream and downstream can remain unchanged and produce the same knowledge graph.
It is all about the architecture. It allows us to leverage the whole document, whether the information is unstructured or semi-structured. It allows us to handle unexpected complications by making small adjustments in a single tool instead of the entire pipeline. In addition to extracting knowledge from documents to produce a knowledge graph, this powerful pipeline allows us to convert natural language questions into SPARQL automatically. The application can be used by almost anyone, regardless of whether the user knows how to code or not.
To learn more about Lymba’s research and publications, click here.