## Optimized Table Tokenization for Table Structure Maksym Lysak and Peter Staar IBM Research {mly,ahn,nli,cau,taa}@zurich.ibm.com Abstract. Keywords: ## 1 Tables are ubiquitous in documents such as scientific papers, patents, reports, In modern document understanding systems [1,15], table extraction is typi- Fig. 1. today, Recently While the majority of research in TSR is currently focused on the develop- The main contribution of this paper is the introduction of a new optimised ta- The paper is structured as follows. In section 2, we give an overview of the ## 2 Approaches to formalize the logical structure and layout of tables in electronic Other work [20] aims at predicting a grid for each table and deciding which cells Within the Im2Seq approaches have shown to be well-suited for the TSR task and allow a ## 3 All known Im2Seq based models for TSR fundamentally work in similar ways. ulary and can be interpreted as a table structure. For example, with the HTML Fig. 2. Obviously, HTML and other general-purpose markup languages were not de- Additionally, it would be desirable if the representation would easily allow In a valid HTML table, the token sequence must describe a 2D grid of table generation. Implicitly, this also means that Im2Seq models need to learn these In practice, we observe two major issues with prediction quality when train- ## 4 To mitigate the issues with HTML in Im2Seq-based TSR models laid out before, ## 4.1 In Figure 3, we illustrate how the OTSL is defined. In essence, the OTSL defines The OTSL vocabulary is comprised of the following tokens: - - - - - - - - - - A notable attribute of OTSL is that it has the capability of achieving lossless Fig. 3. ## 4.2 The OTSL representation follows these syntax rules: - 1. - 2. ## 3. - 4. - 5. - 6. - The left neighbour of an "X" cell must be either another "X" cell or a "U" The application of these rules gives OTSL a set of unique properties. First These characteristics can be easily learned by sequence generator networks, reduces significantly the column drift seen in the HTML based models (see Fig- ## 4.3 The design of OTSL allows to validate a table structure easily on an unfinished ## 5 To evaluate the impact of OTSL on prediction accuracy and inference times, we Fig. 4. We rely on standard metrics such as Tree Edit Distance score (TEDs) for order to compute the TED score. Inference timing results for all experiments ## 5.1 We have chosen the PubTabNet data set to perform HPO, since it includes a Table ## 5.2 We picked the model parameter configuration that produced the best prediction Additionally, Table 2. ## 5.3 To illustrate Fig. 5. Fig. 6. ## 6 We demonstrated that representing tables in HTML for the task of table struc- First and foremost, given the same network configuration, inference time for Secondly, OTSL has more inherent structure and a significantly restricted vo- ## References - 1. - 2. - 3. - 4. - 5. - 6. - 7. - 8. - 9. - 10. - 11. - 12. - 13. - 14. - 15. - 16. - 17. - 18. - 19. - 20. - 21. - 22. - 23.