The bioinformatics arm of the Babraham Institute produces quite a few software packages, including some like FastQC that have achieved nearly ubiquitous adoption in the academic community. Other offerings from Babraham include Bismark for methylation profiling and Sherman for simulating NGS reads from bisulfite-treated DNA.
In my computational genome science class, there was a student that wanted to measure the accuracy of methylation profiling workflows, and he identified Bismark and Sherman as his tools of choice. He would identify a sequence of interest, use Sherman to simulate reads from that sequence, run the methylation call procedure, and then assess the accuracy of the methylation calls.
There was a slight problem, though: Sherman is random in its conversion of Cs to Ts, and the number of conversions can only be controlled by things like error rate and conversion rate. By default, Sherman provides no way to “protect” a C from conversion the way a methyl group does on actual genomic DNA. So there was no way to assess the accuracy of the methylation profiling workflow since we had no way of indicating/knowing which Cs should be called methylated!
After a bit of chatting, however, we came up with a workaround. In our genomic DNA fasta file, any Cs we want to protect from conversion (i.e. methylate them in silico) we simply convert to X. Then we run Sherman, which will convert Cs to Ts at the specified conversion rate but will leave Xs alone. Then, after simulating the reads but before the methylation analysis procedure, we simply change Xs back to Cs. This seemed to pass some sanity tests for us, and we contacted the author of Sherman, Felix Krueger (@FelixNmnKrueger), who confirmed that he saw no potential pitfalls with the approach.
I like this little hack, and assuming I ever use it myself in the future, I will probably create a script that does the C -> X conversion from a GFF3 file or a list of “methylation sites” in some other format (the conversion from X back to C is trivial).