Skip to content

fix: prevent duplicated cell content in TableData.grid when table_cel… #272

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vloum
Copy link

@vloum vloum commented Apr 28, 2025

Title:
fix: prevent duplicated cell content in TableData.grid when table_cells is shorter than grid

Description:

Table

image

Problem

When the number of table_cells is less than the size of the grid in TableData.grid, the previous implementation would fill the remaining grid cells by repeatedly using the last cell's content. This resulted in duplicated content in the output tables, especially in the last row, which could be misleading and did not match the original document layout.

Solution

This PR introduces a counter to ensure that only the available table_cells are assigned to the grid. Any extra cells in the grid will remain empty, preserving the correct table structure and preventing unwanted duplication.

Example

Before (incorrect, duplicated content):

| 规格型号 | 名称/规格 | 原产地 | 数量 | 单价(元) | 总价(元) | 备注 |
|---|---|---|---|---|---|---|
| GH6785 | 颈椎理疗仪 | 广东佛山 | 10 | 4000 | 40000 |   |
| GH7698 | 腰椎理疗仪 | 广东佛山 | 22 | 8000 | 160000 |   |
|   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |
| 合计:人民币(大写)贰拾万  元整(¥210000元) | 合计:人民币(大写)贰拾万  元整(¥210000元) | ...(repeated)|

After (correct, matches original layout):

| 规格型号 | 名称/规格 | 原产地 | 数量 | 单价(元) | 总价(元) | 备注 |
|---|---|---|---|---|---|---|
| GH6785 | 颈椎理疗仪 | 广东佛山 | 10 | 4000 | 40000 |   |
| GH7698 | 腰椎理疗仪 | 广东佛山 | 22 | 8000 | 160000 |   |
|   |   |   |   |   |   |   |
|   |   |   |   |   |   |   |
| 合计:人民币(大写)贰拾万  元整(¥210000元) |   |   |   |   |   |   |

Impact

  • Only affects the behavior of TableData.grid.
  • Output tables will now accurately reflect the original document, with no misleading repeated content.

Compatibility & Risk

  • This is a bugfix and should not affect normal cases where the number of cells matches the grid size.
  • No breaking changes are expected.

Reviewer Notes

  • Please check if the new logic for breaking out of the assignment loop is robust for all edge cases.
  • Consider if there is a more elegant way to handle this mapping, but the current fix addresses the immediate duplication issue.

Copy link

mergify bot commented Apr 28, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

fix: prevent duplicated cell content in TableData.grid when table_cells is shorter than grid

Fixes an issue where, if the number of table_cells is less than the grid size, the last cell's content would be duplicated across extra grid cells. Now, extra cells remain empty, matching the original document layout and preventing misleading repeated content in the output tables.

Signed-off-by: vlou <919070296@qq.com>
@dolfim-ibm
Copy link
Contributor

@vloum what you observe is actually the expected behavior. The Docling markdown serializer is designed on-purpose to repeat the content of merged columns (and rows). This is because the markdown format doesn't have a syntax for it.

If you would like to get the proper merged columns, you could use the HTML serializer which will show your table correctly.
It is also valid to use HTML inside Markdown, this is now possible with the new serializer interface.

@vloum
Copy link
Author

vloum commented Apr 30, 2025

@dolfim-ibm get it! There is no problem with repeated merging, which can avoid missing information due to incorrect assignment of values ​​to some merged cells. However, I think it is unnecessary to repeat the assignment in the last row, and just keep the value empty. If there is a duplicate in the summary row like the last one, it will be considered abnormal data when it is sent to llm. The current code modification only limits the last value and does not affect the normal original logic of the information in the middle cells. Thank you~ I will close this PR depending on your thoughts.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants