Can't process ORC files in Data Factory : ErrorCode=ParquetJavaInvocationException

Joris 6 Reputation points
2020-07-07T09:04:54.987+00:00

Hi,

Our organisation uses ORC formatted files for our central file storage, in our data factory I am unable to process most of the ORC files, I am only able to process really small files (< 1mb) for all the other ORC files the pipeline fails to run.
We need to convert the files to a type we can work with in dataflows such as parquet or csv, but we currently are not able to do this for most files.
The IR we use is the AutoResolveIntegrationRuntime from Azure, we are not able to use a self-hosted IR.
This is the full error when runnning a pipeline with a copy data activity:

{
"errorCode": "2200",
"message": "ErrorCode=ParquetJavaInvocationException,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=An error occurred when invoking java, message: java.nio.BufferOverflowException:Unable to retrieve Java exception..,Source=Microsoft.DataTransfer.Richfile.OrcTransferPlugin,''Type=Microsoft.DataTransfer.Richfile.JniExt.JavaBridgeException,Message=,Source=Microsoft.DataTransfer.Richfile.HiveOrcBridge,'",
"failureType": "UserError",
"target": "Copy data1",
"details": []
}

Can you help us out?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,410 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,024 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. KranthiPakala-MSFT 46,437 Reputation points Microsoft Employee
    2020-07-07T22:35:50.65+00:00

    Hi @Joris-3620,

    Thanks for your query and sorry for your experience.

    The cause for such error is that Default JVM heap size is not enough for JVM to do (de)serialization work in copying orc format data. To mitigate, we need to increase this default JVM heap size.

    However, I could see currently Azure IR is used for this copy activity, and unluckily we could not modify JVM heap size in Azure IR. We need to use a Self-Hosted IR instead. After Self-Hosted IR is created, add the following environment System variable in the machine that hosts the self hosted IR and then restart the IR.:

    _JAVA_OPTIONS "-Xms256m -Xmx16g" (Note: this is only a sample value. You could determine the min/max heap size)

    I see that you have mentioned that you were not able to use a self-hosted IR - could you please elaborate more on why you weren't able to use SHIR for your copy activity? So that I can reach out to internal team for an alternate using Azure IR.

    Please let me know.


    Thank you
    Please do consider to click on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members.