The-Thinking-Person’s-Guide-to-Data-Warehouse-Design-PresentationPPT优秀课件

上传人:每**** 文档编号:60907496 上传时间:2022-03-09 格式:PPT 页数:46 大小:2.97MB
收藏 版权申诉 举报 下载
The-Thinking-Person’s-Guide-to-Data-Warehouse-Design-PresentationPPT优秀课件_第1页
第1页 / 共46页
The-Thinking-Person’s-Guide-to-Data-Warehouse-Design-PresentationPPT优秀课件_第2页
第2页 / 共46页
The-Thinking-Person’s-Guide-to-Data-Warehouse-Design-PresentationPPT优秀课件_第3页
第3页 / 共46页
资源描述:

《The-Thinking-Person’s-Guide-to-Data-Warehouse-Design-PresentationPPT优秀课件》由会员分享,可在线阅读,更多相关《The-Thinking-Person’s-Guide-to-Data-Warehouse-Design-PresentationPPT优秀课件(46页珍藏版)》请在装配图网上搜索。

1、 2010 Calpont Corporation1The Thinking Persons Guide to Data Warehouse DesignRobin SchumacherVP ProductsCalpont 2010 Calpont Corporation2AgendaBuilding a logical design Transitioning to a physical design Monitoring and tuning the design 2010 Calpont Corporation3Building a logical design 2010 Calpont

2、 Corporation4Why care about design? 2010 Calpont Corporation5What is the key component for success?In other words, what you do with your MySQL Server in terms of physical design, schema design, and performance design will be the biggest factor on whether a BI system hits the mark* Philip Russom, “Ne

3、xt Generation Data Warehouse Platforms”, TDWI, 2009. * 2010 Calpont Corporation6First get/use a modeling tool 2010 Calpont Corporation7The logical design for OLTP 2010 Calpont Corporation8Simple reporting databasesOLTP DatabaseRead Shard OneReporting DatabaseApplication ServersEnd UsersETLJust use t

4、he same design on a different boxReplication 2010 Calpont Corporation9Horror story number one 2010 Calpont Corporation10The logical design for analytics/data warehousing 2010 Calpont Corporation11Logical Design Considerations Datatypes are more generally defined, not directed toward a database engin

5、e Entities arent designed for performance necessarily Redundancy is avoided, but simplicity is still a goal Bottom line: you want to make sure your data is correctly represented and is easily understood (new class of user today) 2010 Calpont Corporation12Manual horizontal partitioningModeling techni

6、que to overcome large data volumes 2010 Calpont Corporation13Manual Vertical PartitioningModeling technique to overcome wide tables/rows 2010 Calpont Corporation14Pros/cons to manual partitioning More tables to manage More referential integrity to manage More indexes to manage Joins oftentimes neede

7、d to accomplish query requests Oftentimes, a redesign is needed because the rows / columns you thought youd be accessing together change; its hard to predict ad-hoc query traffic Less I/O if design holds up Easy to prune obsolete data Possibly less object contentionProsCons 2010 Calpont Corporation1

8、5The bottom line on logical modeling Use a modeling tool to capture your designs Do not utilize a third-normal form design for analytics; keep it simple and understandable Manual partitioning is OK in some cases, but,. Let the database engine do the work for you 2010 Calpont Corporation16Transitioni

9、ng to a physical design 2010 Calpont Corporation17SQL or NoSQL?Row or Column database?How to scale?Should I worry about High availability?Index or no?How should I partition my data?Is sharding a good idea? 2010 Calpont Corporation18General list of top BI database design decisions General architectur

10、e / data orientation Storage engine selection Physical table/Index partitioning Indexing creation and placement Optimizing data loads 2010 Calpont Corporation19Divide & conquer is the best approach Whether you choose to go NoSQL, Shard with normal or special MySQL engines, use MPP storage engines, o

11、r something similar, divide & conquer is your best friend You can scale-up and divide & conquer to a point, but you will hit disk, memory, or other limitations Scaling up and out is the best future proof methodology 2010 Calpont Corporation20Divide & conquer via sharding 2010 Calpont Corporation21Wh

12、at technologies you should be looking at* Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009. * 2010 Calpont Corporation22Row or column-based engine? Yes, Row-based tables! Yes, Column-based tables!Will need most columns in a table for queryOnly need subset of columns for queryWil

13、l be doing lots of single inserts/deletesNeed very fast loads; little DMLSmall-medium dataMedium-very large dataKnow exactly what to index; wont changeVery dynamic; query patterns change 2010 Calpont Corporation23Column vs. row orientation A column-oriented architecture looks the same on the surface

14、, but stores data differently than legacy/row-based databases 2010 Calpont Corporation24Example: InfiniDB vs. “Leading” row DBInfiniDB takes up 22% less spaceInfiniDB loaded data 22% fasterInfiniDB total query times were 65% lessInfiniDB average query times were 59% lessNotice not only are the queri

15、es faster, but also more predictable* Tests run on standalone machine: 16 CPU, 16GB RAM, CentOS 5.4 with 2TB of raw data 2010 Calpont Corporation25Why not use both? You can create a hybrid system where you use row-based tables and column-based tables in the same instance and same database Use InnoDB

16、 for OLTP or MyISAM for certain read operations Use column-based tables for analytics, data marts, or warehouses You can scale out with column tables and use row-based tables locally 2010 Calpont Corporation26Why not use both? 2010 Calpont Corporation27MyISAMArchiveMemoryCSV High-speed query/insert

17、engine Non-transactional, table locking Good for data marts, small warehouses Compresses data by up to 80% Fastest for data loads Only allows inserts/selects Good for seldom accessed data Main memory tables Good for small dimension tables B-tree and hash indexes Comma separated values Allows both fl

18、at file access and editing as well as SQL query/DML Allows instantaneous data loadsAlso:Merge for pre-5.1 partitioningMost used DW Storage engines internal to MySQL 2010 Calpont Corporation28What about NoSQL options? Standard model is not relational Typically dont use SQL to access the data Take up

19、more space than column databases Lack special optimizers / features to reduce I/O Really are row-oriented architectures that store data in column families, which are expected to be accessed together (remember logical vertical partitioning?) Individual columns cannot be accessed independently Will be

20、 faster with individual insert and delete operations Will normally be faster with single row requests Will lag in typical analytic / data warehouse use cases 2010 Calpont CorporationPartitioning not if but howmysql CREATE TABLE part_tab - ( c1 int ,c2 varchar(30) ,c3 date ) - PARTITION BY RANGE (yea

21、r(c3) (PARTITION p0 VALUES LESS THAN (1995), - PARTITION p1 VALUES LESS THAN (1996) , PARTITION p2 VALUES LESS THAN (1997) , - PARTITION p3 VALUES LESS THAN (1998) , PARTITION p4 VALUES LESS THAN (1999) , - PARTITION p5 VALUES LESS THAN (2000) , PARTITION p6 VALUES LESS THAN (2001) , - PARTITION p7

22、VALUES LESS THAN (2002) , PARTITION p8 VALUES LESS THAN (2003) , - PARTITION p9 VALUES LESS THAN (2004) , PARTITION p10 VALUES LESS THAN (2010), - PARTITION p11 VALUES LESS THAN MAXVALUE );mysql create table no_part_tab (c1 int,c2 varchar(30),c3 date);* Load 8 million rows of data into each table *m

23、ysql select count(*) from no_part_tab where c3 date 1995-01-01 and c3 select count(*) from part_tab where c3 date 1995-01-01 and c3 delete from t2 where - c3 date 1995-01-01 and c3 alter table t1 drop partition p1;Query OK, 0 rows affected (0.03 sec)Records: 0 Duplicates: 0 Warnings: 0VS.The DROP PA

24、RTITION is A DDL operation, which runs much faster than a DML DELETE. 2010 Calpont Corporation32Index Creation and Placement If query patterns are known and predictable, and data is relatively static, then indexing isnt that difficult If the situation is a very ad-hoc environment, indexing becomes m

25、ore difficult. Must analyze SQL traffic and index the best you can Over-indexing a table that is frequently loaded / refreshed / updated can severely impact load and DML performance. Test dropping and re-creating indexes vs. doing in-place loads and DML. Realize, though, any queries will be impacted

26、 from dropped indexes Index maintenance (rebuilds, etc.) can cause issues in MySQL (locking, etc.) Remember some storage engines dont support normal indexes (Archive, CSV) Remember that a benefit of (most) column databases is that they do not need or use indexes 2010 Calpont Corporation33Optimizing

27、for data loads The two biggest killers of load performance are (1) very wide tables for row-based tables; (2) many indexes on a table; Stating the obvious, LOAD DATA INFILE and the high-speed loaders of column-based engines are the fastest way to load data vs. singleton or array insert statements Co

28、lumn-based tables typically load faster than row-based tables with load utilities, however they will experience slower insert/delete rates than row-based tables Loading data in primary key format helps some engines (e.g. InnoDB). 2010 Calpont Corporation34Optimizing for data loads Move the data as c

29、lose to the database as possible; avoid having applications on remote machines do data manipulations and send data across the wire a row at a time perhaps the worst way to load data Oftentimes good to create staging tables then use procedural language to do data modifications and/or create flat file

30、s for high speed loaders Loading data via time-based order helps some column databases like InfiniDB; logical range partitioning is then possible 2010 Calpont Corporation35Monitoring and tuning the design 2010 Calpont Corporation36Three performance analysis methodsBottleneck analysis Workload analys

31、isRatio analysis 2010 Calpont Corporation37Bottleneck analysis The focus of this methodology is the answer to the question “what am I waiting on?” With MySQL, unfortunately, it can be difficult to determine latency in the database server Lock contention rarely an issue in data warehouses New MySQL p

32、erformance schema has a ways to go in my opinion to be truly useful for bottleneck analysis Problems found in bottleneck analysis translate into better lock handling in the app, partitioning improvements, better indexing, or storage engine replacement 2010 Calpont Corporation38Workload analysis The

33、focus of this methodology is the answer to three questions: (1) Whos logged on?; (2) What are they doing?; (3) How is my machine handing it? Monitor active and inactive sessions. Keep in mind idle connections do take up resources I/O and hot objects a key area of analysis Key focus should be on SQL

34、statement monitoring and collection; something that goes beyond standard pre-production EXPLAIN analysis 2010 Calpont Corporation39Horror story number two 2010 Calpont Corporation40The pain of slow SQL* Philip Russom, “Next Generation Data Warehouse Platforms”, TDWI, 2009. 2010 Calpont Corporation41

35、Workload analysis SQL analysis basically becomes bottleneck analysis, because youre asking where your SQL statement is spending its time Once you have collected and identified your top SQL, the next step is to do tracing and interrogation into each SQL statement to understand its execution Historica

36、l analysis is important too; a query that ran fine with 5 million rows may tank with 50 million or with more concurrent users Design changes usually involve data file striping, indexing, partitioning, or parallel processing additions 2010 Calpont Corporation42Ratio analysis Least useful of all the p

37、erformance analysis methods May be OK to get a general rule of thumb as to how various resources are being used Do not be misled by ratios; for example, a high cache hit ratio is sometimes meaningless. Databases can be brought to their knees by excessive logical I/O Design changes from ratios typica

38、lly include the altering of configuration parameters and sometimes indexing 2010 Calpont Corporation43Conclusions Design is the #1 contributor to the overall performance and availability of a system With MySQL, you have greater flexibility and opportunity than ever before to build well-designed data

39、 warehouses With MySQL, you now have more options and features available than ever before The above translates into you being able to design data warehouses that can be future proofed: they can run as fast as youd like (hopefully) and store as much data as you need (ditto) 2010 Calpont Corporation44

40、For More Information Download InfiniDB Community Edition Download InfiniDB documentation Read InfiniDB technical white papers Read InfiniDB intro articles on MySQL dev zone Visit InfiniDB online forums Trial the InfiniDB Enterprise Edition: http:/ 2010 Calpont Corporation45The Thinking Persons Guide to Data Warehouse DesignRobin S个人观点供参考,欢迎讨论

展开阅读全文
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!