Null-Space Projection for Preserving Emergent Model Behav...

ABSTRACT

This paper introduces Null-Space Projection (NSP), a method to perform targeted edits on large language models while preserving emergent behaviors. NSP identifies the null space of the edit gradient and projects parameter updates onto this subspace, ensuring modifications remain orthogonal to the representations underlying emergent capabilities. Theoretical guarantees and empirical results on GPT-2, LLaMA-2, and Mistral models demonstrate NSP's efficacy in preserving emergent reasoning abilities.

PAPER · PDF

manuscript.pdf ↓ Download PDF

Loading PDF...

↓ View full paper PDF →

Key findings

Null-Space Projection (NSP) preserves emergent behaviors in LLMs during targeted editing.

NSP projects parameter updates onto the null space of the edit gradient, minimizing interference with pre-existing competencies.

Empirical results show NSP maintains edit success rates comparable to MEND and MEMIT while preserving emergent reasoning abilities.

Limitations & open questions

The specific impact of NSP on a broader range of emergent abilities beyond reasoning needs further exploration.

Null-Space Projection for Preserving Emergent Model Behaviors During Targeted Model Editing

Key findings

Limitations & open questions

Related Papers