Data Mining Mini Project #1: Know Your Data, Naive BayesianClassifier, and k-fold cross validation 1

Data Mining Mini Project #1: Know Your Data, Naive BayesianClassifier, and k-fold cross validation 1. Due by 11:55PM, Tuesday, March 12, 2019. 2. Grading guidelines: Posted in the top panel on iLearn. 3. Dataset: The dataset you will use for this project is theCensus Income or Adult dataset, which is available at and paste this URL to your browser if clicking this link does not lead you to the webpage.)For a brief description of this dataset, click the Data Set Description link. To download thedata, click the Data Folder link, then click the file to download the data. Thisdataset contains both categorical and continuous attributes. In addition, this dataset also containsmissing attribute values. 4. Problems (i) Use basic visualization techniques to gain an initialunderstanding of the dataset. Specifically, you are required to visualize the relationshipbetween each attribute and the class label. For a continuous attribute, you might need todiscretize it first using a simple strategy such as equi-width. Please experiment with atleast three different bin widths if you decide to discretize a continuous attribute.Observe these basic visualizations and summarize your main insights. You arestrongly recommended to use Tableau for this task. (ii) Handling missing values: suggest and implement at least twostrategies to handle the missing values for categorical and numeric attributes,respectively. These strategies should be based on your observations made in the previousstep. (iii) Implement a Naïve Bayesian Classifier for this dataset.This dataset contains continuous attributes. Take the following two different approaches tohandle a continuous attribute: (1) using the equal-width binning method to transform thisattribute into a categorical attribute before building the classifier. Select a “proper”width based on your observations made in step (i); (2) assume this attribute followsa Gaussian distribution. (iv) Implement the k-fold cross validation strategy and evaluateyour classifier by setting k=10. Also evaluate the impact of different strategiesimplemented in step (ii) for missing values and the two approaches for handling continuousattributes in step (iii). 5. Requirements (i) Individual work only. (ii) You need to implement this classifier in one of thefollowing programming languages: C, C++, Java, or Python. (iii) To validate your implementation, you can compare yourswith an existing implementation, for instance, the Naive Bayisian Classifierincluded in the Weka data mining suite. (iv) Use 10-fold cross validation to evaluate your algorithm.Adopt the classification accuracy, i.e., #(records correctly classified in the testset)/#(total records in the test set), precision, recall and F1-measure to measure the quality of yourclassifier. Implement this evaluation module either as a separate program or asubroutine in the program implemented in (ii). 6. Submission instructions (i) Archive the following items into one compressed file withyour name in the file name, e.g., a. Source code with comments in the language of your choice. b. Instructions on compiling and running your program. c. A brief description of the main steps that you have adoptedin accomplishing this project. For instance, have you done any data preprocessingtasks? If yes, what are they and why are they necessary? d. Evaluation strategies, results and a brief discussion. Forinstance, are the results acceptable? What can you do to improve the results? (ii) Submit the above file on iLearn. No late submission ore-mail submission will be accepted. 7. Project demonstration You will be required to demonstrate your work in class or duringthe instructor’s office hours. You will be asked to explain the major functions in yourprogram(s) as part of the demonstration. . . .

Needs help with similar assignment?

We are available 24x7 to deliver the best services and assignment ready within 3-4 hours? Order a custom-written, plagiarism-free paper

Order Over WhatsApp Place an Order Online