I was recently faced with the following situation: Given a unified diff between two working trees, created by cvs diff, but containing mostly spurious hunks of changes because of $Log ...$ expansions. Since my diff was about 20k lines I wasn't going to do this all by hand and created the following script.
It requires pylon (by Aaron Bentley), patch-98 is known to work. pylon is available with arch (tla) in aaron.bentley@utoronto.ca--tla-2004/pylon--devel, the archive is located at http://sourcecontrol.net/~abentley/archives/tlasrc/. Since you only need patches.py from it, you can also download only that file. If you do that, you may need to edit the script below to not from pylon import patches but rather only import patches. You'll also have to remove import util from patches.py (which it only uses in the test cases).
To use the script, specify a regular expression that matches hunks to be ignored as the first argument and a filename as the second argument (if no second argument is given, the script uses stdin).
Attention!
currently, the input has to have unix (LF) line terminators, CRLF won't work.
This example strips all hunks that are 11 lines long and have 4 deletions before a single insertion, where the third deleted line contains the text 'original files from':
./filter.py '^(.*\n){3}(-.*\n){2}-.*original files from.*\n(-.*\n)(\+.*\n)(.*\n){3}$' /some/diff/file
So here it goes -- the script:
#!/usr/bin/env python from pylon import patches import re NEXT_FILE_RE = re.compile(r'^diff ') CVS_FILTER_RE = re.compile(r'^(\? |Index: |======|RCS file:|retrieving|Binary files .* differ$)') def get_lines_until_next_file(iter_lines): lines = () line = iter_lines.next() while NEXT_FILE_RE.match(line) is None: lines += (line,) line = iter_lines.next() if lines == (): return None else: return iter(lines) def break_files(iter_lines): while True: # stopexception from get_lines_until_break will propagate through part = get_lines_until_next_file(iter_lines) if not part is None: yield part def filter(iter_lines, match_re = None, invert = True): '''takes anything that iterates over lines, treating it as if it was in unified diff format. This method is a generator in that it "returns" only the lines that belong to those hunks that match (or, if invert is True, do not match) match_re (which must be a valid regular expression object).''' for part in break_files(iter_lines): patch = patches.parse_patch(part) patch_written = False for hunk in patch.hunks: s = str(hunk) if invert: output = not match_re.match(s) else: output = match_re.match(s) if output: if not patch_written: patch_written = True print "--- %s" % patch.oldname print "+++ %s" % patch.newname print s def cvs_diff_filter(iter_lines): for line in iter_lines: if CVS_FILTER_RE.match(line) is None: yield line if __name__ == "__main__": from sys import argv, stdin if len(argv) > 2: f = file(argv[2]) else: f = stdin filter(cvs_diff_filter(f), re.compile(argv[1]))