CloudResearch(云服务)_24

上传人:痛*** 文档编号:140101659 上传时间:2022-08-23 格式:DOC 页数:12 大小:519.50KB
收藏 版权申诉 举报 下载
CloudResearch(云服务)_24_第1页
第1页 / 共12页
CloudResearch(云服务)_24_第2页
第2页 / 共12页
CloudResearch(云服务)_24_第3页
第3页 / 共12页
资源描述:

《CloudResearch(云服务)_24》由会员分享,可在线阅读,更多相关《CloudResearch(云服务)_24(12页珍藏版)》请在装配图网上搜索。

1、Cloud Computing for e-Science with CARMENPaul Watson, Phillip Lord, Frank Gibson, Panayiotis Periorellis, Georgios PitsilisSchool of Computing Science, Newcastle University, Newcastle-upon-Tyne, UKPaul.Watsonnewcastle.ac.ukAbstract. The CARMEN e-science project ( www.carmen.org.uk) is designing a sy

2、stem to allow neuroscientists to share, integrate and analyse data. Globally, over 100,000 neuroscientists are working on the problem of understanding how the brain works. This is a major challenge that could revolutionise biology, medicine and computer science. Solving it requires investigating how

3、 the brain encodes, transmits and processes information. In this paper we describe the CARMEN system. This is a generic e-science platform in the cloud which enables data sharing, integration and analysis supported by metadata. An ex-pandable range of services are provided to extract added value fro

4、m the data. CARMEN is accessed over the Web by neuroinformaticians, who are populat-ing it with content in the form of both data and services. We describe the design of CARMEN and show how it is being used to support neuroscience.1. IntroductionThis paper describes how the CARMEN project is using cl

5、oud computing to address challenging requirements from the key scientific domain of neuroscience. Under-standing how the brain works is perhaps the major challenge remaining in science, and progress in this area could revolutionise several scientific areas, in particular bi-ology, medicine and compu

6、ter science. It would help us to understand how the ge-nome controls brain development, how to design better drugs, and how to design computer systems that can carry out tasks such as image recognition which are largely beyond existing artificial computational systems.Globally, over 100,000 neurosci

7、entists are working on the problem of understand-ing how the brain encodes, transmits and processes information. The primary material for their research are various types of experimental data including molecular (ge-nomic, proteomic and small molecule), neurophysiological (time-series activity), ana

8、-tomical (spatial) and behavioural. As techniques and instruments improve, the quanti-ties of data being collected are increasing. For example, single electrode recording (at around 3MB/min) is giving way to multi-electrode recording with currently tens, and soon hundreds of concurrent signals being

9、 collected. This leads to order of magnitude increases in data collection rate; this is itself likely to be superseded by optical imag-ing techniques that will increase this by a further factor of 10.Unfortunately, although data is at the heart of modern neuroscience, and is expen-sive to collect, i

10、t is rarely shared. This is mainly because each instrument manufac-turer has their own data format, and so it is unlikely that the analysis tools built by one lab can work on data from another lab. Further, typically each lab describes theexperiments and the data they produce in their own informal m

11、etadata format. There-fore, even if another lab could read the data, it is unlikely that they would be able to locate data of interest, or understand the context in which it had been collected.The overall result of this situation is that there are only limited interactions be-tween research centres

12、with complementary expertise, and a severe shortage of analy-sis tools that can be applied across neuronal systems.The CARMEN project was set up in 2006 to address these problems. Its aim is to enable the sharing and collaborative exploitation of both data and analysis code so that neuroscience can

13、get much more value from the data that it is collecting.This paper describes the design of the CARMEN system. This is a generic e-Science platform which enables data sharing, integration and analysis supported by metadata. An expandable range of services is provided to extract value from the data. C

14、ARMEN has adopted a “Cloud Computing” approach in which functionality is ac-cessed over the Web by neuroscientists, who are populating it with content in the form of both data and services. In the rest of this paper we describe the design of CARMEN and give an example showing how it is being used to

15、 support neuroscien-tists.2. CARMEN ArchitectureCloud Computing is of increasing interest in the computing industry. It is concerned with building systems in which users interact with applications remotely over the internet (typically through a web browser). This approach has several advantages for

16、both application providers and users. It prevents the application writer from having to buy and manage their own hardware; instead they can use highly scalable resources in the cloud to meet their needs. Due to the typical commercial “pay-as-you-go” pay-ment regimes, they are only charged for resour

17、ces as they need them and do not have to worry about over-provisioning (which wastes money on underused hardware) nor under-provisioning (which can result in disastrously poor performance for users). For users, having services delivered over the web removes the need to deploy, manage and maintain so

18、ftware on their own resources. With the growth of the mobile internet, it also opens up the possibility of being able to interact with a service from many lo-cations - at work, at home, and while travelling. From the point of view of the re-source providers, it allows them to exploit centralised dat

19、a storage and computation in large data centres which, due to economies of scale, reduces costs and energy con-sumption.Cloud computing has a somewhat different emphasis from Grid computing, which has largely focused on integrating heterogeneous resources, often across multiple or-ganisations, where

20、 no one organisation has sufficient resources to meet the require-ments of particularly challenging applications: “The grid integrates services across distributed, heterogeneous, dynamic virtual organizations formed from the disparate resources within a single enterprise and/or from external resourc

21、e sharing and service provider relationships in both e-business and e-science.” 1Of course, it would be possible to combine resources from more than one Cloud, in which case grid techniques would be of interest, but this is not a current focus.There are limits to what can be achieved with Cloud Comp

22、uting; highly interactive tasks requiring graphically rich interfaces may not work well as web applications. As will be seen, CARMEN utilises one such application - the Signal Data Explorer 2 that is deployed on the users desktop, and so the project is taking the liberal approach of using web-based

23、services where possible, but supporting desktop services where necessary.The Cloud computing approach was attractive for meeting the CARMEN require-ments largely because of the significant amount of data that will be stored and ana-lysed by scientists. Current estimates put this in excess of 100TB b

24、y 2010 for the 20 neuroscientists involved in the project, though if video capture of neuronal activity continues to supersede electrode-based recording this may be a serious underestimate. Where there are huge amounts of data to be processed, it is more efficient to move the computation to the data

25、 rather than the other way around 3. This requires having computational resources closely coupled to the servers holding the data. Cloud com-puting offers the chance to do this if the cloud is internally engineered with fast net-working between the storage and compute servers.Figure 1. An e-Science

26、CloudThe basic aim of CARMEN is therefore to provide a cloud (which we name a CAIRN) that neuroscientists interact with through a web-based portal. We have con-ducted a detailed requirements capture from the scientists in the project. From this, it is clear that the main abilities required by the ne

27、uroscientists are: to upload experimental data to the CAIRN to search for data that meets some criteria (e.g. all data captured under particu-lar experimental conditions) to share data in a controlled (user-defined) way with collaborators. to analyse data. It is not possible to define a closed set o

28、f services that will meet all the analysis needs of all scientists (indeed, new algorithms are being investigated all the time). Consequently, there needs to be a way for scientists to add new services.Existing cloud computing offerings focus on providing low-level compute and data storage services

29、(e.g. Amazon S3 and EC2) . It would be possible to build applications to support these neuroscience requirements directly on this low-level platform, but for CARMEN we chose instead to deploy a set of generic e-science services, and then build domain specific neuroinformatics services and content on

30、 top of these (Figure 1).The selection and design of these services was made based on our experiences in a variety of e-science projects carried out since 2001, targeting a wide range of disci-plines from bioinformatics, though transport, to artistic performance. Figure 2 shows another view of the c

31、ollection of e-Science services in the CARMEN Cloud (which we named a CAIRN). These services are now described in turn.WebPortal.RichClientsSecurityWorkflowEnactmentEngineCompute Cluster on which Servicesare Dynamically Data DeployedMetadata.WebPortalRegistryServiceRepos-itoryFigure 2.The CARMEN CAI

32、RN2.1 DataIn 4 Bowker argued that there are three stages in the “Standard Scientific Model”: Data is Collected Data is analysed and papers are published on the results Data is gradually lost In many scientific disciplines, this occurs because data is kept on individual scien-tists machines and is no

33、t subject to long-term curation. Often, once the papers are written the data is considered to be of lower value, with the time and cost implications for maintenance falling on the individual; as a result it is often lost. This has several undesirable consequences. Papers often draw conclusions from

34、data that is not avail-able to others to examine and analyse themselves - reproducibility is a cornerstone of science, but is impossible if the data is not available. Further, in areas such as neuro-science, data that may be expensive to collect cannot be re-used.The CARMEN approach to addressing th

35、is is to provide ways for users to store, analyse and share data in the CARMEN CAIRN, rather than on their own computers. The CAIRN provides storage for file-based data and structured data. In CARMEN, the primary data is typically sampled voltage signal data collected, for example, from Multi-Electr

36、ode Array recording. Experimenters then upload them into the CAIRN where they are stored in a filestore. Due to the large volume of data that will be pro-duced by the neuroscience experiments, there is an initial requirement to hold in the region of 100TB of data. We use a Storage Resource Broker (S

37、RB 5) for this due to its flexibility and scalability. Whilst the primary data is held in file storage, the de-rived data is stored in a database. This allows researchers to exploit the powerful functionality offered by RDBMS, especially rich querying to select data of interest.2.2 MetadataMetadata

38、is essential for a system such as CARMEN which will hold thousands of data collections; without this, it will be hard to discover or understand the stored data. Therefore, when new data is uploaded, users must specify descriptions of the experi-mental context and conditions through a forms-based int

39、erface. The description of a particular experiment is first defined as a “Minimum Information about a Neurosci-ence Investigation” (MINI) checklist document, analogous to the MIAPE documents for proteomics 6. The information defined in the MINI documents are then struc-tured using the existing FuGE

40、7 standard which is a data model that represents com-ponents of experimental activity.The SyMBA8() package under development in the CISBAN project, (http:/www.cisban.ac.uk/) is a database implementation of the FuGE schema and is being incorporated within the CARMEN CAIRN to upload, store query and r

41、etrieve metadata. When a user uploads data they are presented with a forms-based interface for annotating metadata to data and services.It is important to have a scheme to uniquely identify data and an associated mecha-nism to find the metadata associated with an identifier. There is no one dominant

42、 standard in this area, but we have chosen to adopt LSIDs (Life Science Identifiers 9). They offer a location-independent identifier and an associated protocol that al-lows both data and metadata to be accessed. This implements the Data Registry ser-vice shown in Figure 2.2.3 Managing Analysis Servi

43、cesOnce users have uploaded new data into the CAIRN, or used the registry to locate existing data that is of interest, they will want to analyze it. For example, in neurosci-ence, electrophysiological data may first undergo spike sorting to ascribe the data to specific neurons; next statistical anal

44、ysis may be applied to work out rates of spike firing; finally, graphs may be generated to visualize the results.The section on data referred to Bowkers work on gradual loss of scientific data 4. We argued that this had undesirable consequences, such as loss of re-use and re-producibility, which we

45、are addressing in CARMEN by storing data in the CAIRN rather than on individual scientists machines. However, exactly the same argument can made regarding the programs that are used to analyse data. We can give an equivalent three stages for programs: Programs are written to analyse data Papers are

46、published on the results of the analysis tools. The programs are gradually lost These problems occur because programs are deployed on individual scientists ma-chines and are not subject to long-term maintenance. As with loss of data, this has un-desirable consequences. Papers often draw conclusions

47、from data using programs that are not available to others to examine and use themselves again, reproducibility is impossible if the programs are not available. Further, programs may embody great expertise on the part of the authors, and be developed over many years, but cannot be re-used. Some e-Sci

48、ence projects such as myGrid have attempted to address this by en-couraging authors to “publish” their programs as services (e.g. Web Services) which can be executed remotely by users. This proved largely successful except that it still relies on program owners maintaining the software and the syste

49、ms on which they run. As a result, scientists would sometimes discover that services that they had come to depend on would suddenly disappear.CARMEN is addressing this by providing ways for users to store and run programs in the CARMEN CAIRN, rather than on their own computers. The CAIRN will be a r

50、epository for the long-term storage and curation of analysis programs as well as data. Programs are packaged by their authors as WS-I conformant Web Services (to give very high levels of interoperability and longevity) so that there is a common way of communicating with, and managing, them. Authors

51、upload their services in a deploy-able form into the CAIRN where they stored, and metadata about them is entered into a service registry. This ensures that services are preserved so that computations can be re-run, and services re-used.There is another compelling reason to run the analysis services

52、in the CAIRN. As discussed, neurosciences data sets can be TBs in size. Therefore, it would often not be practical to export the required data out of the CAIRN to a client in for processing - transfer times could be very high, and many scientists would not have the local re-sources to manage such la

53、rge datasets. Instead, having programs run in the CAIRN means that data only has to be transferred within the CAIRN. As the CARMEN CAIRN is realized by a cluster with a high performance internal network, this can be achieved at a high data rate.The Dynasoar 10 dynamic service deployment infrastructu

54、re is used to deploy the services on demand from the repository onto the available compute resources when they are invoked. It achieves this by dividing the handling of the messages sentto a service between two components a Web Service Provider and a Host Provider and defining a well defined interfa

55、ce through which they interact.2.1 The Web Service Provider accepts the incoming SOAP message sent to the end-point and forwards it to a Host Provider, along with a pointer to the service reposi-tory from which a deployable version of the service can be retrieved. 2.2 The Host Providers role is to c

56、ontrol the computational resources in the CAIRN. It accepts the SOAP message from the Web Service Provider (along with any asso-ciated information) and is responsible for processing it and returning a response to the client.When the message reaches the Host Provider, there are two possibilities, dep

57、ending on whether or not the service is already deployed on the node on which the message is to be processed. If the service is already deployed on the node then the Host Provider simply routes the message to the service for processing. This case is shown in Figure 3: a request for a service (s5) is

58、 sent by the Consumer to the endpoint at the Web Ser-vice Provider which passes it on to a Host Provider. The Host Provider already has the service s5 deployed (on nodes 1 and 2 in the diagram) and so, based on the current loading it chooses to route the request to node 2 for processing. Note that t

59、he Web Service provider is not aware of the internal structure of the Host Provider - e.g. the nodes on which the service is deployed nor the node to which the message is sent; this is managed entirely by the Host Provider.2: service fetch &node 1deploys4, s5reqnode 213CWSPs5resWeb ServiceProviderno

60、de ns7Host ProviderFigure 3. A request is routed to an existing deployment of the service Figure 4 shows an example of dynamic service deployment. A request for a service (say s8) is sent by the client to the endpoint at the Web Service Provider which, as be-fore, passes it on to a Host Provider (st

61、ep 1 in the Figure). As s8 is not deployed on any of the nodes it controls, based on loading information it chooses one node (node 2 in this case), fetches the service code from the Web Service Provider and installs the service on that node (step 2). It then routes the request to it for processing (

62、step 3). The response is then routed back to the consumer.Once a service is installed on a node it remains, ready to process future messages until the Host Provider decides to reclaim it. This has the potential to be much moreefficient than job-based scheduling systems in which each job execution re

63、quires the program to be moved and installed.Figure 4. A service is dynamically deployed to process a request.To meet increasing demands for a service from clients, the Host Provider can choose to deploy services on multiple nodes and load-balance requests across them.Dynasoar is agnostic as to the

64、form of the service to be deployed or its internal structure so long as the Host Provider has a deployer for that type of service. CARMEN currently deploys services written in a variety of languages (including MatLab, Java, C+, R) as well as services encapsulated in a VMware Virtual Ma-chine. The latter case allows support for

展开阅读全文
温馨提示:
1: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2: 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
3.本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!