I was recently faced with the following situation: Given a unified diff between two working trees, created by cvs diff, but containing mostly spurious hunks of changes because of $Log ...$ expansions. Since my diff was about 20k lines I wasn't going to do this all by hand and created the following script.
It requires pylon (by Aaron Bentley), patch-98 is known to work. pylon is available with arch (tla) in aaron.bentley@utoronto.ca--tla-2004/pylon--devel, the archive is located at http://sourcecontrol.net/~abentley/archives/tlasrc/. Since you only need patches.py from it, you can also download only that file. If you do that, you may need to edit the script below to not from pylon import patches but rather only import patches. You'll also have to remove import util from patches.py (which it only uses in the test cases).
To use the script, specify a regular expression that matches hunks to be ignored as the first argument and a filename as the second argument (if no second argument is given, the script uses stdin).
currently, the input has to have unix (LF) line terminators, CRLF won't work.
This example strips all hunks that are 11 lines long and have 4 deletions before a single insertion, where the third deleted line contains the text 'original files from':
./filter.py '^(.*\n){3}(-.*\n){2}-.*original files from.*\n(-.*\n)(\+.*\n)(.*\n){3}$' /some/diff/fileSo here it goes – the script:
1 #!/usr/bin/env python
2
3 from pylon import patches
4 import re
5
6 NEXT_FILE_RE = re.compile(r'^diff ')
7 CVS_FILTER_RE = re.compile(r'^(\? |Index: |======|RCS file:|retrieving|Binary files .* differ$)')
8
9 def get_lines_until_next_file(iter_lines):
10 lines = ()
11 line = iter_lines.next()
12 while NEXT_FILE_RE.match(line) is None:
13 lines += (line,)
14 line = iter_lines.next()
15 if lines == ():
16 return None
17 else:
18 return iter(lines)
19
20 def break_files(iter_lines):
21 while True: # stopexception from get_lines_until_break will propagate through
22 part = get_lines_until_next_file(iter_lines)
23 if not part is None:
24 yield part
25
26 def filter(iter_lines, match_re = None, invert = True):
27 '''takes anything that iterates over lines,
28 treating it as if it was in unified diff format.
29 This method is a generator in that it "returns" only
30 the lines that belong to those hunks that match
31 (or, if invert is True, do not match) match_re
32 (which must be a valid regular expression object).'''
33 for part in break_files(iter_lines):
34 patch = patches.parse_patch(part)
35 patch_written = False
36 for hunk in patch.hunks:
37 s = str(hunk)
38 if invert:
39 output = not match_re.match(s)
40 else:
41 output = match_re.match(s)
42 if output:
43 if not patch_written:
44 patch_written = True
45 print "--- %s" % patch.oldname
46 print "+++ %s" % patch.newname
47 print s
48
49 def cvs_diff_filter(iter_lines):
50 for line in iter_lines:
51 if CVS_FILTER_RE.match(line) is None:
52 yield line
53
54 if __name__ == "__main__":
55 from sys import argv, stdin
56 if len(argv) > 2:
57 f = file(argv[2])
58 else:
59 f = stdin
60 filter(cvs_diff_filter(f), re.compile(argv[1]))


