Serialization Exception after changing reading from attached ADLS account to the same ADLS Account but through a linked service.

Question

Currently we are running a Synapse Notebook that reads from the "Primary ADLS Gen2 account" and parse data to JSON. This job has been running stable for a while now. We need to run the same logic but read from a different ADLS account instead, so in order to achieve this we use a linked service using the following spark configuration:

spark.conf.set(f"spark.storage.synapse.$baseUrl%s.linkedServiceName","lnkAdls_keybased")
spark.conf.set(f"fs.azure.account.auth.type.$baseUrl%s", "SAS")
spark.conf.set(f"fs.azure.sas.token.provider.type.$baseUrl%s", "com.microsoft.azure.synapse.tokenlibrary.LinkedServiceBasedSASProvider")

But after changing from where the job reads the data, the job started to failed with the following stack trace:

Caused by: java.io.NotSerializableException: com.microsoft.vegas.vfs.SecureVegasFileSystem  Serialization stack: 
    - object not serializable (class: ___com.microsoft.vegas.vfs.SecureVegasFileSystem___, value: 3.3.09) 
    - field (class: $iw, name: fileSys, type: class org.apache.hadoop.fs.FileSystem) 
    - object (class $iw, $iw@3ccef33c) - field (class: $iw, name: $iw, type: class $iw) 
    - object (class $iw, $iw@310c8203) ....

After searching around looks this issue is related to the use of Hive Metastore by spark sql but that's my best guess. But I don't know what to do with this information, is there a way to disable the use of Hive in a notebook level? Or is there something else I can check in order to figure out a solution or workaround?

Accepted Answer

The java.io.NotSerializableException you're encountering is due to the SecureVegasFileSystem object not being serializable. This issue arises when Spark attempts to serialize objects that aren't designed for serialization, often during operations like broadcasting variables or using certain configurations.

Possible Solutions:

Avoid Serialization of Non-Serializable Objects:

Ensure that objects like SecureVegasFileSystem aren't inadvertently serialized. For instance, avoid using such objects within closures that Spark might serialize.

Use Serializable Wrappers:

If you must use non-serializable objects, consider encapsulating them within serializable wrappers. This approach allows Spark to handle them without serialization issues.

Configure Spark to Use Hive Metastore Properly:

Misconfigurations related to the Hive Metastore can lead to serialization problems. Review your Spark and Hive configurations to ensure they're correctly set up.

Disable Hive Support in Spark:

If Hive support isn't essential for your operations, you can disable it by setting spark.sql.catalogImplementation to 'in-memory'. This change can prevent Spark from attempting to serialize Hive-related objects.

Next Steps:

Review Your Code:

Identify and refactor any code segments where non-serializable objects might be serialized.

Adjust Spark Configurations:

Modify your Spark configurations to prevent unnecessary serialization of non-serializable objects.

Consult Spark Documentation

Refer to the (Apache Spark documentation --> https://spark.apache.org/docs/latest/) for detailed guidance on serialization and configuration best practices.

By implementing these strategies, you should be able to resolve the serialization exception and achieve stable operation when reading from the linked ADLS account.

Share via

Serialization Exception after changing reading from attached ADLS account to the same ADLS Account but through a linked service.

0 additional answers

Your answer