This paper introduces Null-Space Projection (NSP), a method to perform targeted edits on large language models while preserving emergent behaviors. NSP identifies the null space of the edit gradient and projects parameter updates onto this subspace, ensuring modifications remain orthogonal to the representations underlying emergent capabilities. Theoretical guarantees and empirical results on GPT-2, LLaMA-2, and Mistral models demonstrate NSP's efficacy in preserving emergent reasoning abilities.
Key findings
Null-Space Projection (NSP) preserves emergent behaviors in LLMs during targeted editing.
NSP projects parameter updates onto the null space of the edit gradient, minimizing interference with pre-existing competencies.
Empirical results show NSP maintains edit success rates comparable to MEND and MEMIT while preserving emergent reasoning abilities.
Limitations & open questions
The specific impact of NSP on a broader range of emergent abilities beyond reasoning needs further exploration.