undefined symbol: cublasLtGetStatusString at Endpoint Deployment Error

nikviz 16 Reputation points
2022-11-23T17:47:13.07+00:00

Hi

I am getting this error while deploying an end point. This was working fine for 3 months and I had to rebuild due to a minor change and started getting this error.

File "/azureml-envs/tensorflow-2.7/lib/python3.8/site-packages/azureml_inference_server_http/server/user_script.py", line 81, in load_script
main_module_spec.loader.exec_module(user_module)
File "<frozen importlib._bootstrap_external>", line 843, in exec_module
File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
File "/var/azureml-app/221123171914-1667710269/score.py", line 6, in <module>
import ktrain
File "/azureml-envs/tensorflow-2.7/lib/python3.8/site-packages/torch/init.py", line 191, in <module>
_load_global_deps()
File "/azureml-envs/tensorflow-2.7/lib/python3.8/site-packages/torch/init.py", line 153, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/azureml-envs/tensorflow-2.7/lib/python3.8/ctypes/init.py", line 373, in init
self._handle = _dlopen(self._name, mode)
OSError: /azureml-envs/tensorflow-2.7/lib/python3.8/site-packages/torch/lib/../../nvidia/cublas/lib/libcublas.so.11: undefined symbol: cublasLtGetStatusString, version libcublasLt.so.11

This is my environment:

FROM mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.2-cudnn8-ubuntu20.04:20220714.v1

ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/tensorflow-2.7

Create conda environment

RUN conda create -p $AZUREML_CONDA_ENVIRONMENT_PATH \
python=3.8 pip=20.2.4

Prepend path to AzureML conda environment

ENV PATH $AZUREML_CONDA_ENVIRONMENT_PATH/bin:$PATH

Install pip dependencies

RUN HOROVOD_WITH_TENSORFLOW=1 pip install 'matplotlib~=3.5.0' \
'psutil~=5.8.0' \
'tqdm~=4.62.0' \
'scipy~=1.7.0' \
'numpy~=1.21.0' \
'ipykernel~=6.0' \
# upper bound azure-core to address typing-extensions conflict
'azure-core<1.23.0' \
'azureml-core==1.43.0' \
'azureml-defaults==1.43.0' \
'azureml-mlflow==1.43.0.post1' \
'azureml-telemetry==1.43.0' \
'azureml-inference-server-http==0.7.2' \
'pandas==1.4.1' \
'ktrain==0.30.0' \
'sentence-transformers==2.1.0' \
'tensorflow==2.7.0' \
'tokenizers==0.10.3' \
'protobuf~=3.19.1' \
'Flask==2.1.0' \
'transformers==4.10.3'

This is needed for mpi to locate libpython

ENV LD_LIBRARY_PATH $AZUREML_CONDA_ENVIRONMENT_PATH/lib:$LD_LIBRARY_PATH

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,892 questions
Azure Batch
Azure Batch
An Azure service that provides cloud-scale job scheduling and compute management.
330 questions
{count} vote

1 answer

Sort by: Most helpful
  1. nikviz 16 Reputation points
    2022-11-24T17:08:14.207+00:00

    Hi @romungi-MSFT

    I managed to solve it

    kept the container same as 20220729.v1
    and added 'torch==1.12.0' \ 'torchvision==0.13.0' to the list. Didn't need the higher version as given in the link.
    Thanks for that link

    final env file

    FROM mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.2-cudnn8-ubuntu20.04:20220729.v1

    ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/tensorflow-2.7

    Create conda environment

    RUN conda create -p $AZUREML_CONDA_ENVIRONMENT_PATH \
    python=3.8 pip=20.2.4

    Prepend path to AzureML conda environment

    ENV PATH $AZUREML_CONDA_ENVIRONMENT_PATH/bin:$PATH

    Install pip dependencies

    RUN HOROVOD_WITH_TENSORFLOW=1 pip install 'matplotlib~=3.5.0' \
    'psutil~=5.8.0' \
    'tqdm~=4.62.0' \
    'scipy~=1.7.0' \
    'numpy~=1.21.0' \
    'ipykernel~=6.0' \
    # upper bound azure-core to address typing-extensions conflict
    'azure-core<1.23.0' \
    'azureml-core==1.43.0' \
    'azureml-defaults==1.43.0' \
    'azureml-mlflow==1.43.0.post1' \
    'azureml-telemetry==1.43.0' \
    'azureml-inference-server-http==0.7.2' \
    'pandas==1.4.1' \
    'ktrain==0.30.0' \
    'sentence-transformers==2.1.0' \
    'tensorflow==2.7.0' \
    'tokenizers==0.10.3' \
    'protobuf~=3.19.1' \
    'Flask==2.1.0' \
    'transformers==4.10.3' \
    'torch==1.12.0' \
    'torchvision==0.13.0'

    This is needed for mpi to locate libpython

    ENV LD_LIBRARY_PATH $AZUREML_CONDA_ENVIRONMENT_PATH/lib:$LD_LIBRARY_PATH

    2 people found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.