to the points about the models tested here. Why wasn't Claude Sonnet 3.7 (or whichever latest reasoning model was available at the time) or Mistral Large/Magistral tested (a european option that you pointed out was lacking). Would be very interesting to see those results if you did have them! thanks for the interesting article and study.