Completely removing a git commit due to sensitive information


I recently made a big mistake and checked in some sensitive information into a public GitHub repository.

The first reaction is to remove that sensitive information. While that is a good first step, making a new commit to remove the information means that when anyone looks at the commit history, the sensitive information will still be visible.

Re-writing history

The next step is to re-write the git history.
This can be done a few ways, but I took a simple approach and sqaushed the commits down such that the addition and removal of the sensitive information cancel each other out and the new commit doesn’t contain any sensitive information.

To demonstrate, here is the history of a git respository where some sensitive information was commited in bbd80c4 and then removed in a120437.

$> git log --oneline
a120437 Ooops... remove sensitive data!
bbd80c4 Add some more data
0115d7b Add some data
20133f4 Initial commit

$> git show bbd80c4
diff --git a/data b/data
index f8327c3..afbbccd 100644
--- a/data
+++ b/data
@@ -1 +1,3 @@
 KEY=value
+MORE=data
+PASSWORD=secret

$> git show a120437
diff --git a/data b/data
index afbbccd..1f0aa11 100644
--- a/data
+++ b/data
@@ -1,3 +1,2 @@
 KEY=value
 MORE=data
-PASSWORD=secret

From here, I used git’s interactive rebase feature to modify the relevant commits. In this case, I chose the commit just before the sensitive data was added, like this: git rebase -i 0115d7b

In the interactive editor, I did the following and then provided a new commit message:

pick a120437 Ooops... remove sensitive data!
squash bbd80c4 Add some more data

This resulted in the last 2 commits being merged into a new one.
Here is the history:

$> git log --oneline
fd04bfc Add some more data
0115d7b Add some data
20133f4 Initial commit

$> git show fd04bfc
diff --git a/data b/data
index f8327c3..1f0aa11 100644
--- a/data
+++ b/data
@@ -1 +1,2 @@
 KEY=value
+MORE=data

Rewriting commits that have already been pushed to a remote means a couple of things.
For one, when you push this new history to the remote, you may require the --force option to disregard the existing history.
Also, anyone who has an existing clone of the repository will have issues when they pull down the latest changes but in the case of sensitive information, this is a necessary side-effect.

Cleaning up the cached commits

We’re not done yet!
Git keeps track of all changes made to a repository, even though the history does not show the bad commits, they are still there! You can view all changes to the repository using git reflog. From this command you can find the SHA of the bad commit and then use git show to see the sensitive information.

This means that GitHub also still has the bad commits and if you know the SHA you will be able to find that sensitive information again. To fix this we should clear the local cache and GitHub’s cache.

You can clear the your local reflog by issuing these commands:

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
git gc --aggressive --prune=now

GitHub doesn’t give us a way to clear a repositorys cache, but due to the nature of git, simply deleting the repository and pushing a new copy of your local repository to GitHub will effectively destroy that cache.

GitHub also have an article on how to remove sensitive data.