is there a method to create pandas data frame from the document analysis results where a table was extracted?

Brian Beam 25 Reputation points
2024-07-26T17:27:57.87+00:00

I am using Azure Forms Recognizer (layout) prebuilt model to extract tables from PDFs.

The result can be converted into a python dict, that has a section for tables, then cells. The actual table contents are separated out in individual cell content values along with column and row index.

I am assuming there has to be an easy to use method to move that extracted table into a pandas data frame, but I could not find any documentation on it.

I can iterate through the dict grabbing each cell and using its row, col indexes reconstruct the table, but that is not ideal.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,612 questions
{count} votes

Accepted answer
  1. Amanulla Asfak Ahamed 155 Reputation points
    2024-07-26T19:02:40.47+00:00

    Yes, you can convert the table extracted by Azure Forms Recognizer into a pandas DataFrame using Python. Here's a basic approach:

    Extract the Data: Assuming you have the output dictionary from Forms Recognizer that includes the row and column indices for each cell.

    Initialize an Empty DataFrame: Create an empty DataFrame that you'll populate with the data.

    1. Populate the DataFrame: Iterate through the dictionary, using the row and column indices to place each cell's content into the correct location in the DataFrame.

    import pandas as pd

    #Assuming 'data' is the dictionary containing the extracted table information

    data = {

    Copy

       'tables': [
    
           {
    
               'rows': [
    
                   {'cells': [{'row_index': 0, 'col_index': 0, 'text': 'Name'},
    
                              {'row_index': 0, 'col_index': 1, 'text': 'Age'}]},
    
                   {'cells': [{'row_index': 1, 'col_index': 0, 'text': 'Alice'},
    
                              {'row_index': 1, 'col_index': 1, 'text': '24'}]}
    
               ]
    
           }
    
       ]
    

    }

    Create an empty DataFrame

    df = pd.DataFrame()

    Populate the DataFrame

    for table in data['tables']:

    Copy

       for row in table['rows']:
    
           for cell in row['cells']:
    
               row_index = cell['row_index']
    
               col_index = cell['col_index']
    
               text = cell['text']
    
               df.at[row_index, col_index] = text
    

    Adjust column names if necessary

    df.columns = df.iloc[0] # Set the first row as the column header

    df = df[1:] # Remove the first row as it's now the header

    print(df)


1 additional answer

Sort by: Most helpful
  1. Brian Beam 25 Reputation points
    2024-07-26T20:29:28.89+00:00

    Thanks for the response @Amanulla Asfak Ahamed. Looks like there is no direct Azure Forms Recognizer method for this conversion and we have to construct the table cell by cell.

    I made some adjustments to your recommendation posting the code that worked, this handles the fact that there are multiple tables so we will have multiple data frames, and this is using the correct results dictionary keys based on the current version of forms recognizer (layout) model.

    This works:

    #  Convert Azure Forms Recognizer results to dictionary
    result_dict = result.to_dict()
    #  Creating a dictionary to hold all of the dataframes 1 per table
    dfs_dict = {}
    table_number=1
    for table in result_dict['tables']:
        
        df_temp = pd.DataFrame()
        
        for cell in table['cells']:
            row_index = cell['row_index']
            col_index = cell['column_index']
            content = cell['content']
            df_temp.at[row_index, col_index] = content
        
        df_temp.columns = df_temp.iloc[0]
        df_temp = df_temp[1:]
        
        dfs_dict['table_{}'.format(table_number)] = df
        table_number+=1
        
    #  Inspect 1st table
    print(dfs_dict['table_1'])
    
    1 person found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.