《数据仓库与数据挖掘》第9章

上传人：付****f 文档编号：253292935 上传时间：2024-12-10 格式：PPTX 页数：87 大小：701.06KB

收藏版权申诉举报下载

第1页 / 共87页

第2页 / 共87页

第3页 / 共87页

下载文档到电脑，查找使用更方便

20 积分

下载资源

资源描述：

《《数据仓库与数据挖掘》第9章》由会员分享，可在线阅读，更多相关《《数据仓库与数据挖掘》第9章（87页珍藏版）》请在装配图网上搜索。

1、,Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Data Mining: Concepts and Techniques,*,第7章: 分类和预测,What is classification? What is prediction?,Issues regarding classification and prediction,Classification by decision tree inductio

2、n,Bayesian Classification,Classification by Neural Networks,Classification by Support Vector Machines (SVM),Classification based on concepts from association rule mining,Other Classification Methods,Prediction,Classification accuracy,Summary,2024/12/10,1,Data Mining: Concepts and Techniques,Classifi

3、cation:,predicts categoricalclass labels (discrete or nominal),classifies data (constructsa model) basedon thetraining setand thevalues(class labels) in aclassifying attributeand uses it in classifyingnew data,Prediction:,modelscontinuous-valued functions,i.e.,predicts unknown or missingvalues,Typic

4、alApplications,creditapproval,targetmarketing,medicaldiagnosis,treatment effectiveness analysis,Classificationvs. Prediction,2023/3/7,2,Data Mining:Conceptsand Techniques,Classification—A Two-Step Process,Modelconstruction: describingasetofpredeterminedclasses,Each tuple/sampleisassumed to belongtoa

5、 predefinedclass, as determinedbytheclasslabelattribute,Theset of tuplesused formodelconstructionistrainingset,Themodelisrepresentedasclassificationrules, decision trees,ormathematicalformulae,Modelusage: forclassifyingfutureorunknownobjects,Estimateaccuracyofthemodel,Theknownlabeloftestsampleiscomp

6、aredwiththeclassifiedresultfromthemodel,Accuracyrate is thepercentage of testset samplesthatarecorrectly classifiedbythe model,Testsetisindependentoftrainingset,otherwiseover-fittingwilloccur,Iftheaccuracyisacceptable,usethemodeltoclassifydatatupleswhoseclasslabelsarenotknown,2023/3/7,3,DataMining:C

7、onceptsandTechniques,ClassificationProcess(1):ModelConstruction,Training,Data,Classification,Algorithms,IFrank=,‘,‘professor,’,’,ORyears>6,THENtenured=,‘,‘yes’,Classifier,(Model),2023/3/7,4,DataMining:Concepts and Techniques,Classification Process (2):UsetheModel inPrediction,Classifier,Testing,Data

8、,Unseen Data,(,Jeff, Professor,4),Tenured?,2023/3/7,5,DataMining:Concepts and Techniques,Supervised vs. UnsupervisedLearning,Supervised learning(classification),Supervision:Thetraining data (observations, measurements,etc.) are accompanied bylabelsindicating t,he c,lassoftheobservations,Newdataiscla

9、ssified basedonthetrainingset,Unsupervisedlearning(clustering),Theclass labelsoftrainingdata isunknown,Given asetof measurements,observations, etc.withtheaimofestablishingtheexistence of classes orclusters inthedata,2023/3/7,6,Dat,a M,ining:Conceptsand Techniques,第7章: 分,类,类和预,测,测,What is classificat

10、ion?What is prediction?,Issuesregarding classification andprediction,Classificationbydecisiontree induction,BayesianClassification,ClassificationbyNeuralNetworks,ClassificationbySupport VectorMachines(SVM),Classificationbasedonconceptsfrom association rulemining,OtherClassificationMethods,Prediction

11、,Classificationaccuracy,Summary,2023/3/7,7,DataMining: Concepts andTechniques,Issues Regarding Classification andPrediction (1): Dat,Datacleaning,Preprocessdatain orderto reducenoiseandhandle missingvalues,Relevanceanalysis (feature selection),Remove theirrelevant orredundantattributes,Datatransform

12、ation,Generalizeand/or normalize data,2023/3/7,8,Data Mining: Conceptsand Techniques,Issuesregarding classification andprediction (2): EvaluatingClassificationMethods,Predictive accuracy,Speed and scalability,time toconstruct themodel,time touse the model,Robustness,handling noiseand missing values,

13、Scalability,efficiency indisk-residentdatabases,Interpretability:,understandingand insight providedby themodel,Goodness of rules,decision treesize,compactness ofclassification rules,2023/3/7,9,Data Mining:Concepts and Techniques,第7章: 分,类,类和预测,What is classification?What is prediction?,Issuesregardin

14、g classification andprediction,Classification bydecision tree induction,Bayesian Classification,Classification byNeuralNetwo,rks,Classification bySupport Vector Machines(SVM),Classification based onconcepts from association rulemining,OtherClassification Methods,Prediction,Classification accuracy,Su

15、mmary,2023/3/7,10,Data Mining:Conceptsand Techniques,TrainingDataset,This followsanexamplefromQuinlan’sID3,2023/3/7,11,Data Mining:Conceptsand Techniques,Output: ADecisionTreefor,“,“,buys_computer”,age?,overcast,student?,creditrating?,no,yes,fair,excellent,<=30,>40,no,no,yes,yes,yes,30..40,2023/3/7,

16、12,DataMining: Concepts andTechniques,Algorithmfor Decision Tree Induction,Basicalgorithm(a greedyalgorithm),Treeis constructedin atop-down recursive divide-and-conquer manner,At start,all the training examplesareat the root,Attributesarecategorical (ifcontinuous-valued,theyare discretizedin advance

17、),Examples are partitionedrecursively based onselectedattributes,Testattributesareselected on thebasis ofa heuristic orstatistical measure(e.g.,information gain),Conditionsforstopping partitioning,All samples fora given node belongto the same class,Thereareno remaining attributes for furtherpartitio

18、ning –majority votingis employed forclassifying the leaf,Thereareno samplesleft,2023/3/7,13,DataMining:Concepts and Techniques,Attribute SelectionMeasure:InformationGain (ID3/C4.5),Select the attributewith the highest information gain,S contains s,i,tuples of classC,i,fori ={1,,…,…, m},informationme

19、asuresinfo required to classify any arbitrarytuple,,entropyof attributeA withvalues {a,1,,a,2,,…,a,v,},,,informationgainedby branchingonattribute A,,2023/3/7,14,DataMining:Concepts and Techniques,Attribute Selectionby Information GainComputation,Class P:buys_computer =,“,“yes”,Class N:buys_computer

20、=,“,“no,”,”,I(p,n)= I(9, 5) =0.940,Computetheentropyfor,age,:,,,,,,means “age <=30”has5 out of 14samples, with 2yes,’,’esand3 no’s.Hence,,,Similarly,,2023/3/7,15,Data Mining:Concepts and Techniques,OtherAttribute Selection Measures,Gini index(CART,IBM IntelligentMiner),All attributes areassumed cont

21、inuous-valued,Assumethereexistseveral possiblesplitvaluesfor each attribute,May need other tools, such asclustering,to getthe possible split values,Can bemodified for categorical attributes,2023/3/7,16,Data Mining:Concepts and Techniques,Gini,Index(IBM IntelligentMiner),If a data set,T,contains exam

22、plesfrom,n,classes, gini index,,gini,(,T,) is definedas,,where,p,j,is therelative frequency of class,j,in,T.,If a data set,T,is split into twosubsets,T,1,and,T,2,with sizes,N,1,and,N,2,respectively, the,gini,indexof thesplitdatacontains examplesfrom,n,classes, the,gini,index,gini,(,T,) is definedas,

23、,,The attribute provides the smallest,gini,split,(,T,) is chosento split thenode(,need to enumerateall possiblesplitting pointsfor each attribute,).,2023/3/7,17,DataMining: Concepts andTechniques,ExtractingClassificationRules from Trees,Representthe knowledge in theformofIF-THENrules,One rule is cre

24、atedfor each path from the root toa leaf,Eachattribute-valuepairalong a path formsa conjunction,The leaf node holdsthe classprediction,Rulesareeasier forhumans tounderstand,Example,IF,age,= “<=30” AND,student,= “,no,” THEN,buys_computer,= “,no,”,IF,age,= “<=30” AND,student,= “,yes,” THEN,buys_com

25、puter,= “,yes,”,IF,age,= “31,…,…40”THEN,buys_computer,= “,yes,”,IF,age,= “>40”AND,credit_rating,= “,excellent,” THEN,buys_computer,= “,yes,”,IF,age,= “<=30” AND,credit_rating,= “,fair,” THEN,buys_computer,= “,no,”,2023/3/7,18,Data Mining: Conceptsand Techniques,Avoid Overfitting inClassification,

26、Overfitting:An induced tree may overfitthe training data,Too many branches, some mayreflectanomalies dueto noise or outliers,Poor accuracyfor unseen samples,Two approachesto avoid overfitting,Prepruning: Halt treeconstructionearly—do not split anode ifthis would result inthe goodnessmeasurefalling b

27、elowa threshold,Difficult to choose an appropriatethreshold,Postpruning: Remove branchesfrom a,“,“fullygrown”tree—get a sequenceof progressively pruned trees,Use a set of data differentfrom the training data to decide which isthe “best pruned tree,”,”,2023/3/7,19,Data Mining: Conceptsand Techniques,

28、Approaches toDetermine theFinal Tree Size,Separate training (2/3) andtesting(1/3)sets,Use cross validation,e.g.,10-foldcrossvalidation,Use allthe data fortraining,but apply astatistical test(e.g.,chi-square) toestimate whether expandingor pruning a node mayimprove the entire distribution,Use minimum

29、 description length (MDL) principle,haltinggrowthof thetree when theencoding is minimized,2023/3/7,20,Data Mining: Conceptsand Techniques,Enhancements to basicdecision treeinduction,Allow for continuous-valuedattributes,Dynamically define new discrete-valued attributesthat partition the continuous a

30、ttribute value into a discreteset ofintervals,Handlemissingattribute values,Assignthe most common valueof theattribute,Assignprobability toeach of the possiblevalues,Attribute construction,Createnew attributesbasedon existing ones thatare sparselyrepresented,This reduces fragmentation,repetition, an

31、d replication,2023/3/7,21,Data Mining:Conceptsand Techniques,ClassificationinLargeDatabases,Classification—a classicalproblem extensively studiedbystatisticiansandmachinelearningresearchers,Scalability:Classifyingdatasets withmillionsofexamplesand hundreds of attributeswithreasonable speed,Whydecisi

32、ontreeinductionindatamining?,relatively fasterlearningspeed(thanotherclassificationmethods),convertibletosimpleand easytounderstandclassificationrules,canuse SQLqueries foraccessingdatabases,comparable classification accuracy withothermethods,2023/3/7,22,DataMining:ConceptsandTechniques,ScalableDeci

33、sionTreeInductionMethodsinDataMiningStudies,SLIQ(EDBT’96,—,—Mehtaetal.),buildsanindexforeachattributeandonlyclasslistandthecurrentattributelistresideinmemory,SPRINT(VLDB’96,—,—J.Shaferetal.),constructsanattributelistdatastructure,PUBLIC(VLDB’98,—,—Rastogi&Shim),integratestreesplittingandtreepruning:

34、stopgrowingthetreeearlier,RainForest(VLDB’98,—,—Gehrke,Ramakrishnan&Ganti),separatesthescalabilityaspectsfromthecriteriathatdeterminethequalityofthetree,buildsanAVC-list(attribute,value,classlabel),2023/3/7,23,DataMining:Concepts and Techniques,DataCube-BasedDecision-Tree Induction,Integrationof gen

35、eralization with decision-treeinduction (Kamber et al,’,’97).,Classification at primitiveconceptlevels,E.g., precise temperature, humidity,outlook, etc.,Low-level concepts,scattered classes, bushyclassification-trees,Semanticinterpretationproblems.,Cube-based multi-level classification,Relevance ana

36、lysis at multi-levels.,Information-gainanalysis with dimension+ level.,2023/3/7,24,DataMining:Concepts and Techniques,PresentationofClassification Results,2023/3/7,25,Data Mining: Conceptsand Techniques,Visualizationof aDecision Treein SGI/MineSet3.0,2023/3/7,26,Data Mining: Conceptsand Techniques,I

37、nteractive Visual Miningby Perception-Based Classification(PBC),2023/3/7,27,Data Mining: Conceptsand Techniques,第7章: 分类,和,和预测,What isclassification? Whatis prediction?,Issuesregarding classification andprediction,Classificationby decision tree induction,Bayesian Classification,Classificationby Neura

38、l Networks,Classificationby Support Vector Machines(SVM),Classificationbasedon concepts from associationrule mining,Other ClassificationMethods,Prediction,Classificationaccuracy,Summary,2023/3/7,28,Data Mining:Concepts and Techniques,Bayesian Classification:Why?,Probabilistic learning,: Calculateex

39、plicit probabilitiesfor hypothesis, among the mostpractical approaches tocertain types oflearning problems,Incremental,: Eachtraining examplecan incrementallyincrease/decreasethe probability that a hypothesis iscorrect. Prior knowledge canbe combinedwithobserved data.,Probabilistic prediction,: Pr

40、edict multiple hypotheses, weighted by their probabilities,Standard,: EvenwhenBayesian methods are computationallyintractable, theycan providea standardof optimal decision making against which other methodscan be measured,2023/3/7,29,Data Mining:Concepts and Techniques,Bayesian Theorem:Basics,Let Xb

41、e a data sample whose class label is unknown,Let Hbe a hypothesis that X belongsto class C,For classificationproblems, determine P(H/X): the probability that thehypothesis holds given the observeddata sampleX,P(H):priorprobabilityof hypothesis H (i.e. the initial probability before we observe any da

42、ta, reflects the background knowledge),P(X):probabilitythat sampledata is observed,P(X|H): probability ofobserving the sample X,giventhat the hypothesis holds,,2023/3/7,30,DataMining:ConceptsandTechniques,BayesianTheorem,Giventrainingdata,X,posterioriprobabilityofahypothesisH,P(H|X),followstheBayest

43、heorem,,,Informally,thiscanbewrittenas,posterior=likelihoodxprior/evidence,MAP(maximumposteriori)hypothesis,,,Practicaldifficulty:requireinitialknowledgeofmanyprobabilities,significantcomputationalcost,2023/3/7,31,Data Mining: Conceptsand Techniques,Naïve Bayes Classifier,A simplified assumption: at

44、tributesare conditionally independent:,,,The product ofoccurrence ofsay 2elements x,1,and x,2,, giventhe current class isC, isthe product ofthe probabilities ofeach elementtaken separately, given thesame class P([y,1,,y,2,],C) =P(y,1,,C) * P(y,2,,C),No dependencerelation between attributes,Greatlyre

45、duces the computation cost, onlycountthe class distribution.,Once the probabilityP(X|C,i,) is known, assign Xto theclass with maximum P(X|C,i,)*P(C,i,),2023/3/7,32,Data Mining:Conceptsand Techniques,Trainingdataset,Class:,C1:buys_computer=,‘yes’,C2:buys_computer=,‘no,’,’,,Data sample,X =(age<=30,,In

46、come=medium,,Student=yes,Credit_rating=,Fair),2023/3/7,33,Data Mining:Conceptsand Techniques,NaïveBayesianClassifier:Example,Compute P(X/Ci)for eachclass,,P(age=,“,“<30”| buys_computer=“yes”)= 2/9=0.222,P(age=,“,“<30”| buys_computer=“no”)=3/5=0.6,P(income=,“,“medium”| buys_computer=“yes”)=4/9=0.444,

47、P(income=,“,“medium”| buys_computer=“no”)=2/5=0.4,P(student=“yes”|buys_computer=“yes)=6/9=0.667,P(student=“yes”|buys_computer=“no”)=1/5=0.2,P(credit_rating=“fair,”,” |buys_computer=,“,“yes”)=6/9=0.667,P(credit_rating=“fair,”,” |buys_computer=,“,“no,”,”)=2/5=0.4,,X=(age<=30 ,income=medium,student=yes

48、,credit_rating=fair),,P(X|Ci) :,P(X|buys_computer=,“,“yes”)= 0.222 x0.444x0.667x 0.0.667=0.044,P(X|buys_computer=,“,“no,”,”)=0.6 x0.4 x0.2 x0.4 =0.019,P(X|Ci)*P(Ci):,P(X|buys_computer=,“,“yes”)*P(buys_computer=“yes”)=0.028,P(X|buys_computer=,“,“yes”)*P(buys_computer=“yes”)=0.007,,X belongstoclass “b

49、uys_computer=yes”,2023/3/7,34,Data Mining:Conceptsand Techniques,NaïveBayesianClassifier: Comments,Advantages:,Easyto implement,Goodresults obtained inmostof the cases,Disadvantages,Assumption: class conditionalindependence ,thereforelossof accuracy,Practically, dependenciesexist among variables,E.g

50、.,hospitals: patients: Profile: age, family history etc,Symptoms:fever, cough etc., Disease: lung cancer,diabetesetc,Dependencies among thesecannot bemodeled byNaïve BayesianClassifier,How to deal with these dependencies?,Bayesian BeliefNetworks,2023/3/7,35,DataMining: Concepts andTechniques,Bayesia

51、n Networks,Bayesian beliefnetwork allowsa,subset,of the variables conditionallyindependent,A graphical model ofcausal relationships,Represents,dependency,among the variables,Gives aspecification ofjoint probability distribution,X,Y,Z,P,Nodes: random variables,Links: dependency,X,Yaretheparentsof Z,

52、and Yis the parent ofP,No dependency between ZandP,Hasno loopsorcycles,2023/3/7,36,DataMining:Concepts and Techniques,BayesianBeliefNetwork:AnExample,Family,History,LungCancer,PositiveXRay,Smoker,Emphysema,Dyspnea,,LC,~,LC,(,FH,S),(,FH,~S),(~,FH,S),(~,FH,~S),0.8,0.2,0.5,0.5,0.7,0.3,0.1,0.9,BayesianB

53、eliefNetworks,Theconditionalprobabilitytable for the variable LungCancer:,Shows the conditional probability for each possiblecombinationof its parents,,,2023/3/7,37,Data Mining:Concepts and Techniques,Learning BayesianNetworks,Several cases,Givenboth the network structure andall variables observable

54、: learn only theCPTs,Network structureknown,somehiddenvariables:methodof gradientdescent, analogous to neuralnetwork learning,Network structureunknown, allvariables observable: searchthrough themodelspaceto reconstruct graph topology,Unknown structure,all hiddenvariables: no goodalgorithmsknownfor t

55、his purpose,D. Heckerman, Bayesian networks fordata mining,2023/3/7,38,Data Mining:Concepts and Techniques,第7章: 分,类,类和预测,What is classification?What is prediction?,Issuesregarding classification andprediction,Classification bydecision tree induction,Bayesian Classification,Classification byNeuralNet

56、works,Classification bySupport Vector Machines(SVM),Classification based onconcepts from association rulemining,OtherClassification Methods,Prediction,Classification accuracy,Summary,2023/3/7,39,DataMining:ConceptsandTechniques,Classification:,predictscategoricalclasslabels,TypicalApplications,{cred

57、ithistory,salary}->creditapproval(Yes/No),{Temp,Humidity}-->Rain(Yes/No),Classification,Mathematically,2023/3/7,40,DataMining:ConceptsandTechniques,LinearClassification,BinaryClassificationproblem,Thedataabovetheredlinebelongstoclass,‘,‘x’,Thedatabelowredlinebelongstoclass,‘,‘o’,Examples,–,–SVM,Perc

58、eptron,ProbabilisticClassifiers,,x,x,x,x,x,x,x,x,x,x,o,o,o,o,o,o,o,o,o,o,o,o,o,2023/3/7,41,Data Mining: Conceptsand Techniques,DiscriminativeClassifiers,Advantages,prediction accuracy is generally high,(as compared to Bayesian methods –in general),robust,workswhen trainingexamples contain errors,fas

59、t evaluation of the learned target function,(Bayesian networks are normally slow),Criticism,long trainingtime,difficult to understand thelearnedfunction (weights),(Bayesian networks can be used easily forpatterndiscovery),not easy to incorporate domain knowledge,(easy in the form ofpriorson thedata

60、ordistributions),2023/3/7,42,Data Mining: Conceptsand Techniques,NeuralNetworks,Analogyto BiologicalSystems (Indeed a great example ofa goodlearning system),MassiveParallelism allowingfor computational efficiency,The first learning algorithmcame in 1959(Rosenblatt) who suggested that ifa target outp

61、ut valueis provided for a single neuron with fixed inputs, onecan incrementally change weights tolearnto produce these outputs using theperceptron learning rule,,2023/3/7,43,Data Mining: Conceptsand Techniques,A Neuron,The,n,-dimensional input vector,x,is mapped intovariable,y,by means of the scala

62、r product anda nonlinear functionmapping,m,k,-,f,weighted,sum,Input,vector,x,output,y,Activation,function,,weight,vector,w,å,w,0,w,1,w,n,x,0,x,1,x,n,2023/3/7,44,DataMining:Concepts and Techniques,A Neuron,m,k,-,f,weighted,sum,Input,vector,x,output,y,Activation,function,,weight,vector,w,å,w,0,w,1,w,

63、n,x,0,x,1,x,n,2023/3/7,45,DataMining:Concepts and Techniques,Multi-LayerPerceptron,Output nodes,Input nodes,Hidden nodes,Output vector,Input vector:,x,i,w,ij,,,NetworkTraining,Theultimateobjective of training,obtain asetofweightsthatmakes almost all the tuplesinthetrainingdata classifiedcorrectly,St

64、eps,Initialize weights withrandom values,Feedtheinput tuples into the network one by one,Foreachunit,Computethenetinput totheunit asa linear combination ofalltheinputsto the unit,Compute theoutputvalueusingthe activationfunction,Compute theerror,Updatethe weightsand thebias,,,Network Pruningand Rule

65、Extraction,Network pruning,Fullyconnectednetworkwill be hardtoarticulate,N,inputnodes,,h,hiddennodesand,m,outputnodesleadto,h(m+N),weights,Pruning:Removesomeofthelinkswithoutaffectingclassificationaccuracyofthe network,Extracting rules fromatrained network,Discretize activationvalues;replace individ

66、ualactivationvaluebytheclusteraverage maintaining thenetwork accuracy,Enumeratethe outputfrom thediscretizedactivation valuestofind rules betweenactivationvalueandoutput,Find therelationshipbetweentheinputand activationvalue,Combine theabovetwotohaverulesrelatingtheoutput to input,,,Chapter 7. Classification andPrediction,What is classification?What is prediction?,Issuesregarding classification andprediction,Classificationbydecisiontree induction,BayesianClassification,ClassificationbyNeuralNetw

展开阅读全文

温馨提示:
1: 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

点击下载此资源

《数据仓库与数据挖掘》第9章

最新文档

相关资源

相关搜索