Boosting Protein Predictions with Synthetic Data

Data shortages make it hard to map how protein changes affect function and to build reliable models that predict these effects. A new idea called fitness translocation tackles this problem by creating fake variants for a protein of interest. The approach borrows information from proteins that are evolutionarily related and whose variant effects have already been measured. Using advanced protein language models, the method first turns each known variant and its normal version into numerical embeddings. It then calculates how much the variant shifts that embedding compared to its wild type.

Those shift values are applied to the embedding of the target protein’s normal form, producing new “synthetic” variants in the same mathematical space. Because these synthetic points come from real, biologically relevant data, they help fill gaps in the training set. When added to existing examples, the augmented dataset gives machine‑learning models a better chance to learn accurate fitness landscapes. Researchers expect this strategy to improve predictions for proteins that lack extensive experimental data.

actions