CVS diff cleaner

I was recently faced with the following situation: Given a unified diff between two working trees, created by cvs diff, but containing mostly spurious hunks of changes because of $Log ...$ expansions. Since my diff was about 20k lines I wasn't going to do this all by hand and created the following script.

It requires pylon (by Aaron Bentley), patch-98 is known to work. pylon is available with arch (tla) in aaron.bentley@utoronto.ca--tla-2004/pylon--devel, the archive is located at http://sourcecontrol.net/~abentley/archives/tlasrc/. Since you only need patches.py from it, you can also download only that file. If you do that, you may need to edit the script below to not from pylon import patches but rather only import patches. You'll also have to remove import util from patches.py (which it only uses in the test cases).

To use the script, specify a regular expression that matches hunks to be ignored as the first argument and a filename as the second argument (if no second argument is given, the script uses stdin).

Attention!

currently, the input has to have unix (LF) line terminators, CRLF won't work.

This example strips all hunks that are 11 lines long and have 4 deletions before a single insertion, where the third deleted line contains the text 'original files from':

./filter.py '^(.*\n){3}(-.*\n){2}-.*original files from.*\n(-.*\n)(\+.*\n)(.*\n){3}$' /some/diff/file

So here it goes -- the script:

#!/usr/bin/env python
from pylon import patches
import re
NEXT_FILE_RE = re.compile(r'^diff ')
CVS_FILTER_RE = re.compile(r'^(\? |Index: |======|RCS file:|retrieving|Binary files .* differ$)')
def get_lines_until_next_file(iter_lines):
    lines = ()
    line = iter_lines.next()
    while NEXT_FILE_RE.match(line) is None:
        lines += (line,)
        line = iter_lines.next()
    if lines == ():
        return None
    else:
        return iter(lines)
def break_files(iter_lines):
    while True: # stopexception from get_lines_until_break will propagate through
        part = get_lines_until_next_file(iter_lines)
        if not part is None:
            yield part
def filter(iter_lines, match_re = None, invert = True):
    '''takes anything that iterates over lines,
       treating it as if it was in unified diff format.
       This method is a generator in that it "returns" only
       the lines that belong to those hunks that match
       (or, if invert is True, do not match) match_re
       (which must be a valid regular expression object).'''
    for part in break_files(iter_lines):
        patch = patches.parse_patch(part)
        patch_written = False
        for hunk in patch.hunks:
            s = str(hunk)
            if invert:
                output = not match_re.match(s)
            else:
                output = match_re.match(s)
            if output:
                if not patch_written:
                    patch_written = True
                    print "--- %s" % patch.oldname
                    print "+++ %s" % patch.newname
                print s
def cvs_diff_filter(iter_lines):
    for line in iter_lines:
        if CVS_FILTER_RE.match(line) is None:
            yield line
if __name__ == "__main__":
    from sys import argv, stdin
    if len(argv) > 2:
        f = file(argv[2])
    else:
        f = stdin
    filter(cvs_diff_filter(f), re.compile(argv[1]))