Extracting lecture slides from Video

Sometimes, I get lecture videos online, but no slides in PDF or whatever other format, or I could buy the slides for a few euros.

Luckily, being a computer enthusiast, I looked for a quick, easy way to extract the slides from the videos, without too much hassle. My first try was to rely on the video encoder to correctly use keyframes, but it turned out that I ended up with 1000 images for 15 slides. Dooh.

Next try: postprocessing the slides and removing duplicate ones. I came up with a very simple python script, that now runs ffmpeg to extract the keyframes and then does a simple histogram comparison of the images. Now i went down to about 30 images for 15 slides - not bad, most of the dupliate ones were recorded mouse movements or slide transitions. I can live with that.

Here is the script (the histogram part is stolen from http://stackoverflow.com/questions/1927660/compare-two-images-the-python-linux-way)

#!/usr/bin/env python2

import math, operator
from PIL import Image
import sys
import os
import glob
import subprocess
import shutil

def compare(file1, file2):
    image1 = Image.open(file1)
    image2 = Image.open(file2)
    h1 = image1.histogram()
    h2 = image2.histogram()
    rms = math.sqrt(reduce(operator.add,
                           map(lambda a,b: (a-b)**2, h1, h2))/len(h1))
    return rms

if __name__=='__main__':
    if len(sys.argv) < 3:
        sys.exit("Need video file and output dir as parameter")
    if not os.path.exists("decomp"):
        os.mkdir("decomp")
    else:
        sys.exit("decomp already exists, exit")
    if not os.path.exists(sys.argv[2]):
        os.mkdir(sys.argv[2])

    cmd = ["ffmpeg", "-i", sys.argv[1], "-vf", "select='eq(pict_type,I)'", "-vsync", "0", "-f", "image2", "decomp/%09d.png"]

    print "Running ffmpeg: " + " ".join(cmd)

    subprocess.call(cmd)

    print "Done, now eliminating duplicate images and moving unique ones to output folder..."

    filelist = glob.glob(os.path.join("decomp", '*.png'))
    filelist.sort()
    for ii in range(0, len(filelist)):
        if ii < len(filelist)-1:
            if compare(filelist[ii], filelist[ii+1]) == 0:
                print 'Found similar images: ' + filelist[ii] + " and " + filelist[ii+1]
            else:
                print 'Found unique image: ' + filelist[ii]
                head, tail = os.path.split(filelist[ii])
                shutil.copyfile(filelist[ii], sys.argv[2] + os.path.sep + tail)
        else:
            shutil.copyfile(filelist[ii], sys.argv[2] + os.path.sep + tail)
    shutil.rmtree("decomp")