Although the outputs from both models are identical, it’s possible that the evaluation pipelines have small differences in configurations or versions that might be influencing the metric calculations. For instance, settings like threshold values or weights for certain metrics might differ slightly across the two deployments, leading to different results even with the same answers.
Some evaluation metrics, especially those involving NLP, may introduce a small degree of randomness in their computation. For example, metrics like coherence and fluency might be subject to inherent variance depending on how the scoring algorithms are implemented.
Even though the generated answers are identical, there might be hidden context, metadata, or other factors being passed into the evaluation pipeline, for example, fine-tuning metadata or deployment-specific attributes might still affect the evaluation metrics.