由知识挖掘提升商务智能应用(谢邦昌)

上传人:无*** 文档编号:119620018 上传时间:2022-07-15 格式:PPTX 页数:83 大小:7.36MB
收藏 版权申诉 举报 下载
由知识挖掘提升商务智能应用(谢邦昌)_第1页
第1页 / 共83页
由知识挖掘提升商务智能应用(谢邦昌)_第2页
第2页 / 共83页
由知识挖掘提升商务智能应用(谢邦昌)_第3页
第3页 / 共83页
资源描述:

《由知识挖掘提升商务智能应用(谢邦昌)》由会员分享,可在线阅读,更多相关《由知识挖掘提升商务智能应用(谢邦昌)(83页珍藏版)》请在装配图网上搜索。

1、由知识挖掘提升商务智能应用由知识挖掘提升商务智能应用-统计分析的进阶加值应用统计分析的进阶加值应用From Knowledge Mining to Business Intelligence-Advanced From Knowledge Mining to Business Intelligence-Advanced Statistics ApplicationStatistics Application 谢邦昌谢邦昌 博士博士厦门大学讲座教授兼博导厦门大学讲座教授兼博导 首都经贸大学讲座教授兼博导首都经贸大学讲座教授兼博导中央财经大学讲座教授兼博导中央财经大学讲座教授兼博导 西南西南财经大

2、学讲座教授财经大学讲座教授中国人民大学兼职教授中国人民大学兼职教授辅仁大学统计资讯学系及应用统计所教授辅仁大学统计资讯学系及应用统计所教授中华资料采矿协会理事长中华资料采矿协会理事长Outline知识采矿(整合数据采矿与文本采矿)与商业智慧的发展知识采矿程序、步骤、产出与应用如何进行数据采矿与文本采矿整合知识采矿之技术发展评论知识保存价值减少减少循环时间反应时间重复投资作业花费会议时间外界顾问等等增加增加生产力与质量企业知识的转换快且有效的决策课程创新群策群力 等等 知识资产的投资 精简与退休 人员轮替 能力 重复能量消耗 过多的会议 沟通问题 组织目标 可行性 快速 非正规为何知识如此迫切?

3、“The chief economic priority for developed countries is to raise the productivity of knowledge.The country that does this first will dominate the twenty-first century economically.”开发中国家首要经济目标为知识的创造力谁先掌握谁就统领二十一世纪的经济Peter F.Drucker资料知识形成流程DataWarehouseKnowledgeSelection/cleansingPreprocessingTarget D

4、ataPreprocessed DataPatternTransformedData Data MiningTransformationInterpretation/EvaluationIntegrationRawDataUnderstandingBI结构Monitor&IntegratorComplete DataWarehouseExtractTransformLoadRefreshmetadataOLAPServer1.Comprehensive Performance Management2.Analysis3.Query4.Reports5.Data miningData Sourc

5、esToolsServeData MartsOperationalDBsOther sourcesBusiness Intelligence资料采矿/探勘Categorize your customers or clientsClassificationForecast future sales or usagePredictionGroup similar customers or clientsSegmentationDiscover products that are purchased togetherAssociationFind patterns and trends over t

6、imeSequenceGaining market intelligence from news feedsSreekumar Sukumaran and Ashish SurekaIntegrated BI SystemsComplete DataWarehouseETLStructural DataDBMSFile SystemXMLEALegacyUnstructured DataCMSScannedDocumentsEmailETLText taggor&AnnotatorIntermedia DataRDBMSXMLSreekumar Sukumaran and Ashish Sur

7、eka知识来源与价值“On average,professional users spend 11 hours per week looking for information.Seventy-one percent said they could not find what they were looking for.Information Management Software Lazard Freres&Co.LLC February 2001The volume of digitized information will double every year from 2000 to 2

8、005(an increase to 30 times todays volume).Knowledge Management vs.Information Management Gartner Group September 2000网络讯息新闻报导专利电子邮件文件文献问题出版统计8TB(书籍),25TB(新闻),20TB(杂志),2TB(期刊)平均每分钟科学知识增加2000页新材料的阅读须时5年(24hrs/day)How Can I Keep Up With the Literature?Evolution“To study history one must know in advanc

9、e that one is attempting something fundamentally impossible,yet necessary and highly important.”Father Jacobus(Hesses Magister Ludi)Das Glasperlenspiel(The Glass Bead Game)文件知识发掘与管理技术检索检索文件 过滤过滤分类分类摘要摘要 分群分群自然语言内文分析萃取萃取探勘探勘可视化可视化萃取应用萃取应用探勘应用探勘应用信息存取知识认知信息结构知识产生Raw textTermsimilarityDocsimilarityVect

10、or centroid分群分群 d分类分类META-DATA/ANNOTATION d d d d d d d d d d d d d d t t t t t t t t t t t tStemming&Stop wordsTokenized textTerm Weightingw11w12w1nw21w22w2n wm1wm2wmn t1t2 tn d1 d2 dmSentenceselection摘要摘要Text ETL to MiningCall Taker:JamesDate:Aug.30,2002Duration:10 min.CustomerID:ADC00123Q:cust sy

11、s hasstopped working.A:checked custbios anditneed updated.Unstructured DataStructured DataCall Taker JamesDate 2002/08/30Duration 10 min.CustomerID ADC00123NounCustomerSoftwareBIOSSubj.Verb customer system.stopSW.Problem BIOS.needOriginal DataMeta DataLinguisticAnalysisTaggingDependency AnalysisName

12、d Entity ExtractionIntention AnalysisCategoryDictionarySynonymDictionaryCategoryItemVisualization&Interactive MiningMiningIBM TAKMI(Nasukawa,Nagano,1999)Mining target:individual textMining unit:texts category labeled items extracted from text using NLPText is Tough其系一个极不容易表达的抽象性概念其系一个极不容易表达的抽象性概念(AI

13、-Complete)是许多概念彼此间抽象而复杂的无尽关系组合是许多概念彼此间抽象而复杂的无尽关系组合一种名词可以代表很多不同的概念一种名词可以代表很多不同的概念CELL,IV类似的概念也有很多种方式可以表达类似的概念也有很多种方式可以表达(aliases)space ship,flying saucer,UFO,figment of imagination概念是很难加以可视化的概念是很难加以可视化的高维度高维度 其分析构面可能高达成百上千Text Mining is Easy重复性很高重复性很高只要一些简单的算法,就可以从一些极为粗糙的工只要一些简单的算法,就可以从一些极为粗糙的工作中,得到不

14、错的结果作中,得到不错的结果找出重要词组找出重要词组找到有意义的相关字找到有意义的相关字从文章中建立摘要从文章中建立摘要主要问题主要问题:结果评估结果评估必须定义目标及目的必须定义目标及目的Traditional IR-based Extractiondocvector 1profile vector docvector nscoringscorejudgments rejected docs accepted docs noyesvectorlearningthresholdlearningutility functionOntologyVector initializationThresh

15、old initializationReuse retrieval algorithmsNew threshold algorithmsScore?threshold Text-DBLexiconsLuhns ideasIt is here proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance.It is further proposed that the relative position within a sentenc

16、e of words having given values of significance furnish a useful measurement for determining the significance of sentences.The significance factor of a sentence will therefore be based on a combination of these two measurements.信息萃取-Job2 JobTitle:Ice Cream Guru Employer: JobCategory:Travel/Hospitalit

17、y JobFunction:Food Services JobLocation:Upper MidwestContact Phone:800-488-2611 DateExtracted:January 8,2001 Source: OtherCompanyJobs:-Job1Information ExtractionGiven:Source of textual documentsWell defined limited query(text based)Find:Sentences with relevant informationExtract the relevant informa

18、tion and ignore non-relevant information(important!)Link related information and output in a predetermined formatAdvisoryProgrammer-Oracle(Austin,TX)Response Code:1008-0074-97-iexc-jcn Responsibilities:This is an exciting opportunity withSiemens Wireless Terminals;a start-up venture fully capitalize

19、d by a Global Leader in Advanced Technologies.Qualified candidates will:Responsible for assisting with requirements definition,analysis,design and implementation that meet objectives,codes difficult and sophisticated routines.Develops project plans,schedules and cost data.Develop test plans and impl

20、ement physical design of databases.Develop shell scripts for administrative and background tasks,stored procedures and triggers.Using Oracles Designer 2000,assist with Data Model maintenance and assist with applications development using Oracle Forms.Qualifications:BSCS,BSMIS or closely related fiel

21、d or related equivalent knowledge normally obtained through technical education programs.5-8 years of professional experience in development,system design analysis,programming,installation using Oracle developmentAutomatic Pattern-Learning SystemsPros:Portable across domainsTend to have broad covera

22、geRobust in the face of degraded input.Automatically find appropriate statistical patternsSystem knowledge not needed by those who supply the domain knowledge.Cons:Annotated training data,and lots of it,is needed.Isnt necessarily better or cheaper than hand-built solnExamples:Riloff et al.,AutoSlog,

23、Soderland WHISK(UMass);Mooney et al.Rapier(UTexas);Ciravegna(Sheffield)Learn lexicon-syntactic patterns from templatesTrainerDecoderModelLanguageInputAnswersAnswersLanguageInputText Analysis SpectrumEntity ExtractionTargeted Factsand EventsClassificationClusteringConceptIdentificationWhat is thisdoc

24、umentabout?Who didwhat towhom whenwhere,etc.Why is getting dimensional data so hard?Hank bought plastic explosives from Henry inTucson yesterday.Named Entity ExtractionPeople,Weapons,Vehicles,DatesNEREngineHankHenryPlastic explosivesTucson11/01/07FrameNetName Extraction via MMsTextSpeechRecognitionE

25、xtractorSpeechEntities NEModelsLocationsPersonsOrganizationsThe delegation,which The delegation,which included the included the commander of the U.N.commander of the U.N.troops in Bosnia,Lt.troops in Bosnia,Lt.Gen.Sir Michael Rose,Gen.Sir Michael Rose,went to the Serb went to the Serb stronghold of

26、Pale,near stronghold of Pale,near Sarajevo,for talks with Sarajevo,for talks with Bosnian Serb leader Bosnian Serb leader Radovan Karadzic.Radovan Karadzic.TrainingProgramtrainingsentencesanswersThe delegation,which The delegation,which included the included the commander of thecommander of theU.N.U

27、.N.troops introops inBosniaBosnia,Lt.,Lt.Gen.SirGen.SirMichael RoseMichael Rose,went to the Serb went to the Serb stronghold ofstronghold ofPalePale,nearnear Sarajevo Sarajevo,for talks,for talks with Bosnian Serb with Bosnian Serb leaderleader Radovan Radovan KaradzicKaradzic.An easy but successful

28、 HMM application:Prior to 1997-no learning approach competitive with hand-built rule systemsSince 1997-Statistical approaches (BBN(Bikel et al.1997),NYU,MITRE,CMU/JustSystems)achieve state-of-the-art performanceNER数据库探勘作业流程documentDocumentCollectionsunbeachFrequent term set:surffunsun,beachclusterC1

29、C2C4C5C3Clustering:C1,C2,C4,C5.Clustering Description:surf,sun,beach,fun.AnophelesFeedback as Model InterpolationConcept CD)|(DQDDocument DResultsFeedback DocsF=d1,d2,dnFQQ)1(Generative modelDivergence minimizationQF=0No feedbackFQ=1Full feedbackQQ非单调性资料(Heterogeneous)TDRTDRTDRTDRTDR成千成万的历史纪录巨量分析文件分

30、群文件分群 1000解决方案解决方案个案库Mooter科学人杂志3月号文件数据分群Annotation and TaggingOnNovember 16,2005,IBM announced it hadacquired Collation,a privately held companybased inRedwood City,California forundisclosed amount.DateAcquiringOrganizationAcquisitionEventAcquiredOrganizationPlaceAmountText AnnotatorDateOrganizatio

31、nPlaceAmountNov.16IBMRedwood City,CAUndisclosedOutput toRDBMSXMLoutputOn November 16,2005,IBM announced it had acquired Collation,a privately held company based in Redwood City,California for undisclosed amount.Linguistic Concept Extractionfrom Customer Service Records Bag of“Words”extractionCstmr I

32、DCustomerYellowIncHappyNotSwitchCellPhoneExpressionsextractionCstmr IDCustomerYellow IncswitchCell PhoneNot happyNamed EntitiesextractionCustomer CRM termCstmr?Yellow Inc Telco CompanyCell Phone Telco TermNot happySwitchEvents/SentimentExtractionCustomer(cstmr)cell phone unhappy(Negative)Switch to(N

33、egative Predicate)yellow inc(Competition)CombinedWith structured dataDecision MakingChurner Special OfferKnowledge InferenceInformation ExtractionInformation RetrievalExtracting Information From TextStructuring knowledge from texttagging,compounds,grammatical analysis,ontological interpretation,regu

34、lar expressions,patter recognitionTextDatabaseMinimalrecursionsemanticsrepresentationsDeep Thought EU projectKnowledge ConstructionWant to extract prominent concepts/relations from texttagging,compounds,NP recognition,term frequencies,stopwords,language identificationBrasethvik&Gulla,DKE,38/1,2001Do

35、maindoc.coll.OntologyStatistical&linguisticanalysesPatterns ConstructionTaipeiTokyoNew YorkRepositoryTagging&annotationCDWKnowledge RepositoryOr structured dataPatternsPatternsExplorerWeb BrowserHard diskWindows XPDesktop computerHard disk size 40 GBProductsLaptopcomputersOperating SystemLinuxMacint

36、oshis acrashesInstalled from http:/.人、事、时、地、物元资料participate in人物人物性质性质Conceptual ObjectsPhysical EntitiesTemporal Entities应用应用affect or/refer torefer to/refinerefer to/identifielocationatwithin地点地点时间时间资源索引人物人物事件事件物件物件Derivedknowledgedata(e.g.RDF)ThesauriextentCRM entitiesOntologyexpansionSourcesandm

37、etadata(XML/RDF)Backgroundknowledge/AuthoritiesCIDOCCRM orDCConcept LatticeC1:(D1,)C2:(d1,d2,d4,t1,t6)C3:(d3,d4,t4)C4:(d1,d2,t1,t3,t5,t6)C5:(d4,t1,t4,t6)C6:(d3,t2,t4)C7:(,T1)The formal conceptC4 has two own termst3,t5 and two inheritedterms t1,t6Given the context(D1,T1)whereD1=d1,d2,d3,d4&T1=t1,t2,t

38、3,t4,t5,t6 R t1 t2 t3 t4 t5 t6d11 0 1 0 1 1 d21 0 1 0 1 1d30 1 0 1 0 0d41 0 0 1 0 1Table:The input relationR=documents keywordsHasseDiagramP14 performedP11 participated inP94 has createdE31 Document“Yalta Agreement”E7 Activity“Crimea Conference”E65 Creation Event*E38 ImageP86 falls withinP7 took pla

39、ce atP67 is referred to byE52 Time-SpanFebruary 1945P81 ongoing throughoutP82 at some time withinE39 ActorE39 ActorE39 ActorE53 Place7012124E52 Time-Span11-2-1945Explicit Events,Object Identity,SymmetryRules ExtractionThe formal concept C4 makes it possible the following rules R1:t3 t1 t6R2:t5 t1 t6

40、R3:t3 t5The interpretation of the R1 and R2:The use of terms t3 or t5 is always associated with that of terms t1 and t6The rule R3 express mutual equivalence of the terms t3,t5:All the documents which have the term t3 also have the t5 term.文献文献知识群组专家与决策知识呈现实时性分群Real-time IndexMetadata ofSearching Re

41、sults公文性资料中低收入户补助因果图-失依儿童各县市福利,信托基金的成立所在各县市失依儿童状态各县市政府,社会局等介入 对单亲家庭的补助之灾后重建及经费相关使用灾后重建基金规则Clustering范例很适合用机洗香味好闻去污力强洗衣省力气味清香能去除99种污渍洗得特别干净香味好闻白袜子洗得最干净气味很香不伤手能够很好的去除污渍衣服不易褪色洗衣不费力能去除99种污渍用量少洗得干净对皮肤刺激少洗各种污渍都很干净洗得干净价格适当洗衣服的效果较好气味不错一直使用该品牌洗好的衣物更白气味好闻广告印象深洗得干净易漂清不太伤手洗得干净用量少洗得干净用量比别的牌子少广告大洗得干净用量少质量好用量少洗得干净

42、包装好广告多,吸引人香味好闻洗的干净、白宣传好,广告有趣很多人都说好知识脉络知识地图事件追踪信息检索知识概念Kuhns Descriptive ProjectImmature ScienceNormal ScienceAnomaliesCrisisRevolutionTasks in News DetectionNews FeedsDetectionSegmentationOn-LineRetroTrackingMight be RelevantLocationAden,YemenDateOctober 12,200011:18 am(UTC+3)Attack typesuicide bomb

43、ingDeaths19(including the 2 perpetrators)Injured39Perpetrator(s)al-Qaeda,carried out by Ibrahim al-Thawr and Abdullah al-Misawa911事件可预防FBI 明尼苏达干员Zacarias Moussaoui 个人计算机FBI凤凰城备忘录(George Will)Dr.Bhandari(Virtual Gold,Inc)资料探勘 可预防911悲剧恐怖份子911恐怖份子网络911恐怖份子网络赤军旅(RedArmy Faction)威胁Horst Herold(德国联邦警察总长德国

44、联邦警察总长)建立数据探勘之信息网GermanysBundeskriminalamt 1972数据源房屋销售、能源公司成果Rolf Heissler(RAF 成员)结果erold遭报导违反人权退休1986修改犯罪条例911三个飞行员系来自Hamburg疫病警示及通报系统世界卫生组织多年前即建立了疫病警示及通报系统(Epidemic Alert and Response)。由于一些国家可能基于经济冲击的考虑,可能淡化有关疫情的报导,世界卫生组织的这套系统特别装置了一套软件,可以由各国媒体的网站上由各国媒体的网站上抓抓取相关资料取相关资料并由二十位专家分析这些资料中的信并由二十位专家分析这些资料中

45、的信息息。HighWire.stanford.edu信息 与 知识 Amazon数字相机销售新闻事件华盛顿时报美国家卫生院 NIH热门研究Proposals by Funding/Date across IRGs and Activity Types疾病诊疗指引 Athena/EON-StanfordAthena临床指引R.D.Shankar,et al.2001高血压临床指引 Athena Hypertension GuidelineA.Advani,et al.2003受灾户(金融辅助政策)贷款(受灾户、临时住宅)Generative Discriminative重建家园专案金融机构贷款震

46、灾重建暂行条例受灾户房屋利息损毁灾户objectmethodObject:attributeObject:attributeObject:attributeObject:conditionObject:attributeObject:Attribute(condition)Object:attributeSpecifyGeneralizeIntegrating Distributed Knowledge Adaptive knowledge infrastructure is in place Knowledge resources identified and shared appropri

47、ately Timely knowledge gets to the right person to make decisions Intelligent tools for authoring through archiving Cohesive knowledge development between JPL,its partners,and customers Instrument design is semi-automatic based on knowledge repositories Mission software auto-instantiates based on un

48、ique mission parameters KM principals are part of Lab culture and supported by layered COTS products Remote data management allows spacecraft to self-command Knowledge gathered anyplace from hand-held devices using standard formats on interplanetary Internet Expert systems on spacecraft analyze and

49、upload data Autonomous agents operate across existing sensor and telemetry products Industry and academia supply spacecraft parts based on collaborative designs derived from JPLs knowledge systemCapturing KnowledgeSharing Knowledge MarsNet Europa Orbiter Space Interferometry MissionEnables capture o

50、f knowledge at the point of origin,human or robotic,without invasive technologyEnables seamless integration of systems throughout the world and with robotic spacecraftEnables sharing of essential knowledge to complete Agency tasksModeling Expert Knowledge Systems model experts patterns and behaviors

51、 to gather knowledge implicitly Seamless knowledge exchange with robotic explorers Planetary explorers contribute to their successors design from experience and synthesis Knowledge systems collaborate with experts for new research Interstellar missions Permanent colonies Europa Lander/Submersible Ti

52、tan Organics:Lander/Aerobot Neptune Orbiter/Triton Observer Mars robotic outposts Comet Nucleus Sample Return Saturn Ring Observer Terrestrial Planet Finder2003200720102025Enables real-time capture of tacit knowledge from experts on Earth and in permanent outposts未来(NASA)决策指引决策指引 利用高速计算机中心之广域网格系统以让各

53、公家机构能上传档案进行知识分析 订立作业标的与程序,进行分散与高速之档案知识汇整 建立分布式高速环境以协助进行知识的建立、分享,让各级政府据以作为决策的参考 自动串联不同类型之决策议题,以累积相似 自动化系统分析、撷取、连结各类议题并与以整合 利用串联之档案知识建置操作系统,协助各类专家建立专业Ontology 自动产生文件与Ontology间相关串联索引以模块化决策知识 整合案例推理、实证与决策指引,以形成完整之自动化决策辅助推理机制 系统自动依据Ontology与文件之经验串联专业知识以协助建立决策指引 建立决策评量机制,以能精修专业决策 建立模拟机制,设定各类型状况,进行情境模拟推演 各

54、单位可透过分布式网格进行措施推演 根据Ontology与所有案例,产生决策执行之机率推论模式 依据推论规则与概率,推估可能的决策与其风险 设定意外与不确定状况,进行未确定环境推估,并作负向推论,以了藉相对风险 藉由推论,产生程序与作为决策模拟决策模拟分布式知识网格分布式知识网格政策规划研究政策规划研究 协助进行全面性的政策规划,不论有无经验,可依据规划(OR,Planning)法则进行优化规划优化规划 计算作业程序与路径,找出关键途径,并作择优选择 利用Ontology与分布式网格进行高速之最佳作业研最佳作业研究规划究规划 依实际状况修正所规划之政策 迅速分析误差,实时性提出优化之建议2005200620072008Roadmap发展建议敬请指教Q&A演讲完毕,谢谢观看!

展开阅读全文
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!