Still, if one wanted to show policy-compliant CoTs directly to users while avoiding putting strong supervision on them, one could use a separate model, such as a CoT summarizer or sanitizer, to accomplish that.
But this might not work for very long. Unrestricted CoT could realize by reading prompts and answers on the internet or learning material, and comparing with what it produces, that it is being sanitized, and will learn to lie sooner, then better, to still meet other misaligned goals. Exactly like humans in psychologically unsafe environments.