Diff files: what are hunks and how to extract them?

2014-02-06

First, I will introduce the patchfile format, then how to split up a patch up into multiple patch files, either by filenames or hunks.

I’m going to use the term patchfile for the output of the diff(1) command, which filenames are usually .diff suffixed, eg mypatch.diff

What is a patchfile?

When comparing 2 files, the diff(1) command tries to record differences as groups of differing lines, and uses common lines to anchor these groups within the files. Such groups are called hunks of difference.

Example of a patchfile with 3 hunks (they are prefixed by @@):

$ diff -u group.orig group
--- group.orig	2014-02-04 19:38:20.800277081 +0100
+++ group	2014-02-04 19:38:33.366452009 +0100
@@ -1,5 +1,4 @@
 root:x:0:root
-bin:x:1:root,bin,daemon
 daemon:x:2:root,bin,daemon
 sys:x:3:root,bin,user1
 adm:x:4:root,daemon
@@ -7,8 +6,6 @@
 disk:x:6:root
 lp:x:7:daemon,user1,user2
 mem:x:8:
-kmem:x:9:
-wheel:x:10:root,user1
 ftp:x:11:
 mail:x:12:
 uucp:x:14:
@@ -17,8 +14,6 @@
 locate:x:21:
 rfkill:x:24:
 smmsp:x:25:
-http:x:33:
-games:x:50:user1,user2
 lock:x:54:
 uuidd:x:68:
 network:x:90:user1,user2

You have noticed an extra header line:

--- group.orig	2014-02-04 19:38:20.800277081 +0100
+++ group	2014-02-04 19:38:33.366452009 +0100

Because a patchfile can contain differences of several files, each set of hunks starts with a similar two-line header, to indicate the source and the modified file to which the next hunks are related to. The timestamps are the modification time of each file.

The patchfile above is in unified format (diff -u option), bigger than the default normal context format below, but it adds the context lines needed by patch(1) to correctly apply the patchfile.

$ diff group.orig group
2d1
< bin:x:1:root,bin,daemon
10,11d8
< kmem:x:9:
< wheel:x:10:root,user1
20,21d16
< http:x:33:
< games:x:50:user1,user2

The meld GUI tool helps to clearly outline the 3 hunks:

Meld

Description of the hunk header, with the 3rd hunk:

@@ -17,8 +14,6 @@

-17 : from file (/tmp/group.orig), hunk context starts from the 17th line
,8 : the hunk is 8 lines long
+14 : to file (/tmp/group), hunk context starts from the 14th line
,6 : the hunk is 6 lines long

The context becomes obvious: 3 lines around the differences, this is the default diff(1) context.

Extracting hunks

The patch command does not allow to select the hunks to apply, only all hunks as a whole. Given the above example, it’s not possible to only apply the third hunk or only the first one.

However it’s sometimes handy to apply:

Only the huks related to a specific set of files
Only the hunks related to cosmetic changes
For a specific file, only the last hunk

I’ve still not found a way to do that directly with the patch(1) and diff(1), but I’ve found a trick: splitting out the hunks of a patchfile to separate files, one per hunk, or one per patched file.

Splitpatch is a tool to automate this process.

Splitpatch

Let’s take a realworld example from CFEngine repository:

First grab splitpatch, and read its documentation:

$ wget https://raw2.github.com/benjsc/splitpatch/master/splitpatch.rb -O splitpatch
$ chmod +x ./splitpatch
$ ./splitpatch --help

splitpatch splits a patch that is supposed to patch multiple files
into a set of patches.
Currently splits unified diff patches.
If the --hunk option is given, a new file is created for each hunk.
If the --fullname option is given, new files are named using the
full path of the patch to avoid duplicates in large projects.

Fetch the patchfile we are going to use, and try to get some information thanks to diffstat(1)

$ wget https://github.com/cfengine/core/commit/6a2972ab804e903051987564e5c9a4182bcc5c6f.patch -o original.diff
$ diffstat original.diff
libpromises/evalfunction.c                         |  101 ++++------
libutils/string_lib.c                              |  7
tests/acceptance/01_vars/02_functions/readstringarrayidx.cf | 59 +++++
tests/acceptance/01_vars/02_functions/readstringarrayidx.cf.txt |    4
4 files changed, 116 insertions(+), 55 deletions(-)

The original.diff patchfile affects 4 files, with certainly many hunks, but it’s hard to be more precise without having a look at the file.

Split original.diff into a set of patchfiles, grouped by file modified:

$ ./splitpatch  original.diff
File null.patch already exists. Renaming patch.
loic@iron[0]: ~/tmp/patch
$ ls -1
evalfunction.c.patch
null.patch
null.patch.0
original.diff
splitpatch
string_lib.c.patch

Check some patchfiles to verify they are each one related to only one file:

$ diffstat evalfunction.c.patch
 libpromises/evalfunction.c |  101 ++++++++++++++++++++++-----------------------
 1 file changed, 50 insertions(+), 51 deletions(-)

$ diffstat string_lib.c.patch
 libutils/string_lib.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

It is even possible to get a separate patch for every hunk of original.diff:

$ ./splitpatch --hunks original.diff
File null.0.patch already exists. Renaming patch.

$ ls -1
evalfunction.c.0.patch
evalfunction.c.1.patch
evalfunction.c.2.patch
evalfunction.c.3.patch
null.0.patch
null.0.patch.0
original.diff
splitpatch
string_lib.c.0.patch
string_lib.c.1.patch
string_lib.c.2.patch

The 4 hunks related to the evalfunction.c file are now available separately (evalfunction.c.0.patch, evalfunction.c.1.patch, …)

What about null.0.patch. files? Sounds weird, because the original.diff patchfile doesn’t affect any null file, so, from where do they come from? It happens when the patch is meant to create a new file, so the from file is /dev/null. The header of null.0.patch is self-explaining:

$ head -2 null.0.patch
--- /dev/null
+++ b/tests/acceptance/01_vars/02_functions/readstringarrayidx.cf

Git patch mode

The patch mode of git-add(1) looks a bit overkill, but is a powerful tool and does the job. The idea is to selectively stage the hunks one by one, then to use git diff to generate a set of patchfiles.

Here is a good introduction to Git patch mode, so I’m not going to paraphrase it here.