Azure Synapse Delta Table with Varchar(max)

Axel 0

Hi all,

I did save a Dataframe as Delta-format with pyspark and created a managed table in a Lake Database, to access it aswell with SQL Script and the Serverless-Endpoint

raw_stream\
    .writeStream\
    .format("delta")\
    .outputMode("append") \
    .option("checkpointLocation", bronze_path_chk_path)\
    .start(bronze_table_path)

spark.sql(f'CREATE TABLE IF NOT EXISTS bronze_layer.{entity} USING DELTA LOCATION \'{bronze_table_path}/\'')

As some columns, which are nested JSON, have a lenght of more then 10000, I receive an error when querying over the SQL Serverless

Error: Column 'JSON_column' of type 'NVARCHAR(4000)' is not compatible with external data type 'JSON string. (underlying parquet nested/repeatable column must be read as VARCHAR or CHAR)'.

Primary Question : How can I define the columntype as varchar(max) from within a notebook? So the table can be directly queried.

Secondary Question: If from within a notebook is not possible, how to set it up in a Lake Database? Serverless Database is not an option, as I need the managed Delta Tables.

One more hint, the following Query returns proper results within SQL on the Lake Database


SELECT 
    column1,
    column1,
   
      JSON_VALUE(JSON_column, '$.property1') AS property1,
      JSON_VALUE(JSON_column, '$.property2') AS property2,

FROM
    OPENROWSET(
        BULK 'https://***.dfs.core.windows.net/bronze/entity/',
        FORMAT = 'DELTA'
    )  
    WITH
    (
		column1,
		column2,
        JSON_column VARCHAR(MAX)
   
    ) query

Thanks

Smaran Thoomu 14,870 Reputation points Microsoft Vendor

2024-09-04T22:48:48.1466667+00:00
Hi @Axel
Thank you for using Microsoft Q&A platform and thanks for posting your query here.

It seems that you are trying to query a Delta table with a nested JSON column that has a length of more than 10000, and you are receiving an error when querying over the SQL Serverless. The error message indicates that the column 'JSON_column' of type 'NVARCHAR(4000)' is not compatible with external data type 'JSON string. (underlying parquet nested/repeatable column must be read as VARCHAR or CHAR)'.

To define the column type as varchar(max) from within a notebook, you can use the spark.sql() function to execute a SQL statement that alters the table schema. Here is an example:

spark.sql("ALTER TABLE bronze_layer.entity ALTER COLUMN JSON_column TYPE VARCHAR(MAX)")

This statement will alter the column type of the 'JSON_column' column in the 'entity' table to varchar(max).

If you want to set it up in a Lake Database, you can use the same ALTER TABLE statement in SQL. Here is an example:

ALTER TABLE bronze_layer.entity ALTER COLUMN JSON_column TYPE VARCHAR(MAX)

Please note that the above statement will only work if the table is not currently being used. If the table is being used, you will need to lock the table before altering the schema.

I hope this helps! Let me know if you have any further questions.
Axel 0 Reputation points

2024-09-05T06:27:03.84+00:00

double Post

1 answer

Axel 0 Reputation points

2024-09-05T06:14:09.0966667+00:00
Thank You for the response, but that does not work. For the spark.sql execution throws an error on the type "varchar(max)"

ParseException: [PARSE_SYNTAX_ERROR] Syntax error at or near 'MAX'.(line 1, pos 81) == SQL == ALTER TABLE bronze_layer.rntity ALTER COLUMN JSON_column TYPE VARCHAR(MAX)

If I use "string" instead, it works fine, but as the column is allready in that type it changes nothing. A query from SQL continues to fail.

For the second command, I only reveive "Query completed with errors.", but no further details
Please sign in to rate this answer.
Smaran Thoomu 14,870 Reputation points Microsoft Vendor

2024-09-05T14:40:51.8433333+00:00

@Axel Thanks for your response. Can I confirm that your issue is resolved?

Axel 0 Reputation points

2024-09-06T06:49:46.12+00:00

Actually no, the issue is still there. As I did describe

Smaran Thoomu 14,870 Reputation points Microsoft Vendor

2024-09-06T19:13:48.9666667+00:00

@Axel Yes, you are correct that the "VARCHAR(MAX)" syntax is not supported in Spark SQL. Instead, you can use "STRING" to define the column type as a string with unlimited length. However, as you mentioned, this may not solve the issue with querying the nested JSON column.

Regarding the second command, the "Query completed with errors" message indicates that there was an error in the SQL statement. To get more details about the error, you can try running the command with the "VERBOSE" option, like this:

ALTER TABLE bronze_layer.entity ALTER COLUMN JSON_column TYPE STRING VERBOSE;

This will provide more detailed information about the error.

If you are still having issues with querying the nested JSON column, you may need to use a different approach, such as flattening the JSON column or using a different data type. You can find more information about working with nested data in Delta tables in the Delta Lake documentation.

I hope this helps. Let me know if you have any further questions.

Smaran Thoomu 14,870 Reputation points Microsoft Vendor

2024-09-09T22:02:58.1266667+00:00

@Axel We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Axel 0 Reputation points

2024-09-10T10:35:00.5766667+00:00

I don't have a solution and I also don't have more details. The problem persists and has nothin to do with it beeing a JSON. The Problem is simply due to the length of the column.

SQL Serverless in Synapse is basically unusable, if dealing with data which has more than 8000 characters in one column

Smaran Thoomu 14,870 Reputation points Microsoft Vendor

2024-09-10T17:41:33.73+00:00

@Axel I'm sorry to hear that the previous solution did not work for you. It's possible that the issue you are experiencing is related to the maximum column size limit in SQL Serverless, which is currently set to 8000 characters. If you are working with data that has columns with more than 8000 characters, you may encounter issues when trying to load or query that data in SQL Serverless.

One possible workaround for this issue is to split the data into multiple columns or rows, so that each column or row contains less than 8000 characters. Another option is to consider using a different data storage solution that supports larger column sizes, such as Azure SQL Database or Azure Cosmos DB.

If you have any further questions or concerns, please let me know.
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

Azure Synapse Delta Table with Varchar(max)

1 answer

Your answer