Some more results from our experiments with GEC with LLMs
https://alphacephei.com/nsh/2025/03/15/generative-error-correction.html
Most 8B models at 4b quantization are not very stable, hallucinations present in about 25% cases. Qwen is very unstable for this task.
Gemma2 and Gemma3 are ok, yet to try 27B version.
Simple prompt from the papers certainly doesn’t work. One has to provide much more details and specific issues in prompt. We yet to work on the prompt more.
Even prompt formatting matters, by modifying the prompt format we were able to reduce WER from 26% to 16%
For now GEC doesn’t seem like a breakthrough tech, it seems like something like extra sause is needed, simple ROVER is equally ok and much more stable.
We discussed on the channel with iLa that English prompt helps for non-English language. I think it is possible for some models but I can’t confirm in experiments.
For big model input split doesn’t help much.
There are still a lot of overcorrection of proper names which are rare and unknown to LLM and overcorrection of grammar. We need to work more on it.
The difference between Gemma2-9B and Gemini Flash is not very large except for number of hallucinations.
Most models have very poor knowledge in rare domains and poor knowledge about speech (phonetics).
>>Click here to continue<<