DesigningLatticeQCDClusters

上传人:xx****x 文档编号:240562409 上传时间:2024-04-15 格式:PPT 页数:22 大小:308KB
收藏 版权申诉 举报 下载
DesigningLatticeQCDClusters_第1页
第1页 / 共22页
DesigningLatticeQCDClusters_第2页
第2页 / 共22页
DesigningLatticeQCDClusters_第3页
第3页 / 共22页
资源描述:

《DesigningLatticeQCDClusters》由会员分享,可在线阅读,更多相关《DesigningLatticeQCDClusters(22页珍藏版)》请在装配图网上搜索。

1、DesigningLattice QCD Clusters Supercomputing04November 6-12,2004Pittsburgh,PAAspects of PerformanceLattice QCD codes require:excellent single and double precision floating point performancemajority of Flops are consumed by small complex matrix-vector multiplies(SU3 algebra)high memory bandwidth(prin

2、cipal bottleneck)low latency,high bandwidth communicationstypically implemented with MPI or similar message passing APIsGeneric Single Node PerformanceMILC is a standard MPI-based lattice QCD code“Improved Staggered”is a popular“action”(discretization of the Dirac operator)Cache size=512 KBFloating

3、point capabilities of the CPU limits in-cache performanceMemory bus limits performance out-of-cacheFloating Point Performance(In cache)Most flops are SU3 matrix times vector(complex)SSE/SSE2/SSE3 can give a significant boostSite-wise(M.Lscher)Fully vectorized(A.Pochinsky)requires a memory layout wit

4、h 4 consecutive reals,then 4 consecutive imaginariesMemory PerformanceMemory bandwidth limits depends on:Width of data bus(Effective)clock speed of memory bus(FSB)FSB history:pre-1997:Pentium/Pentium Pro,EDO,66 Mhz,528 MB/sec1998:Pentium II,SDRAM,100 Mhz,800 MB/sec1999:Pentium III,SDRAM,133 Mhz,1064

5、 MB/sec2000:Pentium 4,RDRAM,400 MHz,3200 MB/sec2003:Pentium 4,DDR400,800 Mhz,6400 MB/sec2004:Pentium 4,DDR533,1066 MHz,8530 MB/secDoubling time for peak bandwidth:1.87 yearsDoubling time for achieved bandwidth:1.71 years 1.49 years if SSE includedMemory Bandwidth PerformanceLimits on Matrix-Vector A

6、lgebraFrom memory bandwidth benchmarks,we can estimate sustained matrix-vector performance in main memoryWe use:66 Flops per matrix-vector multiply96 input bytes24 output bytesMFlop/sec=66/(96/read-rate+24/write-rate)read-rate and write-rate in MBytes/secMemory bandwidth severely constrains performa

7、nce for lattices larger than cacheMemory Bandwidth PerformanceLimits on Matrix-Vector AlgebraCommunications I/O BusesLow latency and high bandwidths are requiredPerformance depends on I/O bus:at least 64-bit,66 MHz PCI-X is required for LQCDPCI Express(PCI-E)is now availablenot a bus,rather,one or m

8、ore bidirectional 2 Gbit/sec/direction(data rate)serial pairsfor driver writers,PCI-E looks like PCIserver boards now offer X8(16 Gbps/direction)slotsweve used desktop boards with X16 slots intended for graphics but,Infiniband HCAs work fine in these slotslatency is also better than PCI-Xstrong indu

9、stry push this year,particularly for graphics(thanks,DOOM 3!)should be cheaper,easier to manufacture than PCI-X Communications-FabricsExisting Lattice QCD clusters use either:MyrinetGigabit ethernet(switched or multi-D toroidal mesh)Quadrics also a possibility,but historically more expensiveSCI work

10、s as well,but has not been adoptedEmerging(finally)is Infinibandlike PCI-E,multiple bidirectional serial pairsall host channel adapters offer two independent X4 portsrich protocol stacks,now available in open sourcetarget HCA price of$100 in 2005,less on motherboardPerformance(measured at Fermilab w

11、ith Pallas MPI suite):Myrinet 2000(several years old)on PCI-X(E7500 chipset)Bidirectional Bandwidth:300 MB/sec Latency:11 usecInfiniband on PCI-X(E7500 chipset)Bidirectional Bandwidth:620 MB/sec Latency:7.6 usecInfiniband on PCI-E(925X chipset)Bidirectional Bandwidth:1120 MB/sec Latency:4.3 usecNetw

12、orks Myrinet vs InfinibandNetwork performance:Myrinet 2000 on E7500 motherboardsNote:much improved bandwidth,latency on latest Myrinet hardwareInfiniband PCI-X on E7501 motherboardsImportant message size region for lattice QCD is O(1K)to O(10K)Infiniband on PCI-X and PCI-E Unidirectional bandwidth(M

13、B/sec)vs message size(bytes)measured with MPI version of NetpipePCI-X on E7500“TopSpin MPI”from OSU“Mellanox MPI”from NCSAPCI-E on 925X NCSA MPI8X HCA used in 16X“graphics”PCI-E slotInfiniband ProtocolsNetpipe results,PCI-E HCAs using these protocols:“rdma_write”=low level(VAPI)“MPI”=OSU MPI over VA

14、PI“IPoIB”=TCP/IP over InfinibandTCP/IP over InfinibandTCP/IP options:“IPoIB”-full TCP/IP stack in Linux kernel“SDP”-new protocol,AF_SDP,instead of AF_INET.Bypasses kernel TCP/IP stackSocket-based code has no other changesData here were taken with the same binary,using LD_PRELOAD for SDP runProcessor

15、 ObservationsUsing MILC“Improved Staggered”code,we found:the new 90nm Intel chips(Pentium 4E,Xeon“Nacona”)have lower floating point performance at the same clock speeds because of longer instruction latenciesbut better performance in main memory(better hardware prefetching?)dual Opterons scale at ne

16、arly 100%,unlike Xeonsbut must use NUMA kernels+libnuma,and alter code to lock processes to processors and allocate only local memorysingle P4E systems are still more cost effectivePPC970/G5 have superb double precision floating point performancebut memory bandwidth suffers because of split data bus

17、.32 bits read only,32 bits write only numeric codes read more than they writepower consumption very high for the 2003 CPUs(dual G5 system drew 270 Watts,vs 190 Watts for dual Xeonwe hear that power consumption is better on 90nm chipsPerformance Trends Single NodeMILC Improved Staggered Code(“Asqtad”

18、)Processors used:Pentium Pro,66 MHz FSBPentium II,100 MHz FSBPentium III,100/133 FSBP4,400/533/800 FSBXeon,400 MHz FSBP4E,800 MHz FSBPerformance range:48 to 1600 MFlop/secmeasured at 124Doubling times:Performance:1.88 yearsPrice/Perf.:1.19 years!Performance Trends-ClustersClusters based on:Pentium I

19、I,100 MHz FSBPentium III,100 MHz FSBXeon,400 MHz FSBP4E(estimate),800 FSBPerformance range:50 to 1200 MFlop/sec/nodemeasured at 144 local lattice per nodeDoubling Times:Performance:1.22 yearsPrice/Perf:1.25 yearsPredictionsLatest(June 2004)Fermi purchase:2.8 GHz P4EPCI-X800 MHz FSBMyrinet(reusing ex

20、isting fabric)$900/node1.2 GFlop/node,based on 1.65 GF single node performance(measured:1.0 1.1 GFlop/node,depending on 2-D or 3-D communications)PredictionsLate 2004:3.4 GHz P4E800 MHz FSBPCI-ExpressInfiniband$900+$1000(system+network per node)1.4 GFlop/node,based on faster CPU and better networkPr

21、edictionsLate 2005:4.0 GHz P4E1066 MHz FSBPCI-ExpressInfiniband$900+$900(system+network per node)1.9 GFlop/node,based on faster CPU and higher memory bandwidthPredictionsLate 2006:5.0 GHz P4(or dual core equivalent)1066 MHz FSB(“fully buffered DIMM technology”)PCI-ExpressInfiniband$900+$500(system+n

22、etwork per node)3.0 GFlop/node,based on faster CPU,higher memory bandwidth,cheaper networkFuture PlansThrough SciDAC,DOE is funding lattice QCD centers at Fermilab,Jefferson Lab,BrookhavenPlans below are for Fermilab systemsLate 2004:256 Pentium 4E nodes,PCI-E,Infinibandat least 800 MHz FSB,1066 MHz

23、 if availablemid-2005:256 additional P4E nodes(1066 MHz FSB)will expand Infiniband fabric so that jobs can use as many as 512 processors2006:1024 nodes on InfinibandDual core Pentium 4?2007:1024 nodesstart yearly refresh,replacing 1/3rd of existing systemsnote that network fabrics only need upgrading every two refreshesFor More InformationFermilab lattice QCD portal:Fermilab benchmarks:US lattice QCD portal:

展开阅读全文
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!