I was recently faced with the following situation: Given a unified diff between two working trees, created by cvs diff, but containing mostly spurious hunks of changes because of $Log ...$ expansions. Since my diff was about 20k lines I wasn't going to do this all by hand and created the following script.

It requires pylon (by Aaron Bentley), patch-98 is known to work. pylon is available with arch (tla) in aaron.bentley@utoronto.ca--tla-2004/pylon--devel, the archive is located at http://sourcecontrol.net/~abentley/archives/tlasrc/. Since you only need patches.py from it, you can also download only that file. If you do that, you may need to edit the script below to not from pylon import patches but rather only import patches. You'll also have to remove import util from patches.py (which it only uses in the test cases).

To use the script, specify a regular expression that matches hunks to be ignored as the first argument and a filename as the second argument (if no second argument is given, the script uses stdin).

/!\ currently, the input has to have unix (LF) line terminators, CRLF won't work.

This example strips all hunks that are 11 lines long and have 4 deletions before a single insertion, where the third deleted line contains the text 'original files from':

./filter.py '^(.*\n){3}(-.*\n){2}-.*original files from.*\n(-.*\n)(\+.*\n)(.*\n){3}$' /some/diff/file

So here it goes – the script:

   1 #!/usr/bin/env python
   2 
   3 from pylon import patches
   4 import re
   5 
   6 NEXT_FILE_RE = re.compile(r'^diff ')
   7 CVS_FILTER_RE = re.compile(r'^(\? |Index: |======|RCS file:|retrieving|Binary files .* differ$)')
   8 
   9 def get_lines_until_next_file(iter_lines):
  10     lines = ()
  11     line = iter_lines.next()
  12     while NEXT_FILE_RE.match(line) is None:
  13         lines += (line,)
  14         line = iter_lines.next()
  15     if lines == ():
  16         return None
  17     else:
  18         return iter(lines)
  19 
  20 def break_files(iter_lines):
  21     while True: # stopexception from get_lines_until_break will propagate through
  22         part = get_lines_until_next_file(iter_lines)
  23         if not part is None:
  24             yield part
  25 
  26 def filter(iter_lines, match_re = None, invert = True):
  27     '''takes anything that iterates over lines,
  28        treating it as if it was in unified diff format.
  29        This method is a generator in that it "returns" only
  30        the lines that belong to those hunks that match
  31        (or, if invert is True, do not match) match_re
  32        (which must be a valid regular expression object).'''
  33     for part in break_files(iter_lines):
  34         patch = patches.parse_patch(part)
  35         patch_written = False
  36         for hunk in patch.hunks:
  37             s = str(hunk)
  38             if invert:
  39                 output = not match_re.match(s)
  40             else:
  41                 output = match_re.match(s)
  42             if output:
  43                 if not patch_written:
  44                     patch_written = True
  45                     print "--- %s" % patch.oldname
  46                     print "+++ %s" % patch.newname
  47                 print s
  48 
  49 def cvs_diff_filter(iter_lines):
  50     for line in iter_lines:
  51         if CVS_FILTER_RE.match(line) is None:
  52             yield line
  53 
  54 if __name__ == "__main__":
  55     from sys import argv, stdin
  56     if len(argv) > 2:
  57         f = file(argv[2])
  58     else:
  59         f = stdin
  60     filter(cvs_diff_filter(f), re.compile(argv[1]))