Home Page

HW #1 Due October 19, 2007.

For this homework I want you to either

Implement the apriori algorithm for mining association rules (for CS Majors, 5% of the overall grade). Your program should run on large data sets efficiently. I will group your programs in to three categories, fast, average, and slow. Given that your programs run correctly, those in the category fast will have 5pts, average will have 4pts, and slow will have 3 pts out of 5. You also need to write a short report describing what data structures or techniques you used to increase the speed and efficiency.

The input to your program will be a text file where each line contains a customer id, transaction id, and the item id. You don't have to do anything regarding the customer ID for the time being. Consecutive lines with the same transaction id form a transaction. Your program will also take the minimum support and minimum confidence thresholds as parameter as well as the number of transactions and the number of items. Items will be represented by numbers in the range 1... maxitems. The output of your program will be the large itemsets and the association rules obtained from those large itemsets with support and confidence grater than or equal to the given threshold values.

Refer to the following paper for the detailed description of the apriori algorithm.

R. Agrawal and R. Srikant. "Fast Algorithms for Mining Association Rules". Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, Sept. 1994. Abstract. [PS] [PDF]

Click Here for more information on the datasets