Articles

  • Windows 8 client

    Results of one of Master Thesis Diploma created at Department of Computer Architecture were applications working in Windows 8 system and its previous versions (Vista, 7) for COMCUTE systems. This applications are attached below.

    Archive with source code of application working in Modern UI interface in system Windows 8 is presented below. Application can be installed directly with Visual Studio 2012 development tool with valid developer license or generating binary files and installing this app as trusted (this option is available only on Windows 8 Enterprise).

    COMCUTE_Client_src.zip

    Below are presented archives containing source code of application working in Windows Vista, 7, 8 (in desktop mode) and binary files of application (for 32-bit architecture).

    COMCUTE_Client_src.zip

    COMCUTE_Client_W7_bin_x86.zip


  • Mersenne Number Finding and Collatz Hypothesis Verification in the Comcute Grid System

    In this chapter, some mathematic applications have been described to test scalability of the Comcute grid system. Especially, a verification of the Collatz hypothesis and finding Mersenne numbers were applied to prove the scalability and high performance of this grid system. Results were compared with outcomes obtained by the other grid systems.


  • Distributed Detection of Selected Features in Data Streams Using Grid-class Systems

    This chapter describes basic methodology of distributed digital signal processing. A choice of distributed methods of detection of selected features in data streams using grid-class systems is discussed. Problems related to distribution of data for processing are addressed. A mitigating method for data distribution and result merging is described.


  • Image Processing Techniques for Distributed Grid Applications

    In this chapter, parallel approaches to 2D and 3D convolution processing of series of images have been presented. A distributed, practically oriented, 2D spatial convolution scheme has been elaborated and extended into the temporal domain. Complexity of the scheme has been determined and analysed with respect to coefficients in convolution kernels. Possibilities of parallelisation of the convolution operations have been analysed and the results presented. Serial and parallel variants of 2D convolution schemes are proposed and their time-cost trade-offs are discussed. Flexibility of the solution with regard to scalable size of the kernel has been highlighted. The image processing techniques are analysed with respect to be applied in active distributed grid processing systems in the Internet, and their direct orientation toward the Comcute system has been deliberatively spotlighted.


  • Data Partitioning and Task Management in the Clustered Server Layer of the Volunteer-based Computation System

    While the typical volunteer-based distributed computing system [1] focus on the computing performance, the Comcute system [3] was designed especially to keep alive in the emergency situations. This means that designers had to take into account not only performance, but the safety of calculations as well. Quadruple-layered architecture was proposed to separate the untrusted components from the core of the system. The main layer (W) consists of independent server nodes, which are coupled into a cluster. The W-servers provide task promotion among the nodes, data partitioning, results gathering, comparing and merging. The cluster remains operational as long as one of the nodes is able to operate. This paper describes the functionality of the Comcute system from the W-node perspective considering two task parameters: required performance level and required level of computational reliability.


  • Genetic Programming for Workload Balancing in the Comcute Grid System

    In this chapter, a genetic programming paradigm is implemented for reliability optimization in the Comcute grid system design. Chromosomes are generated as the program functions and then genetic operators are applied for finding Pareto-suboptimal task assignment and scheduling. Results are compared with outcomes obtained by an adaptive evolutionary algorithm.


  • On Configurability of Distributed Volunteer-Based Computing in the Comcute System

    The chapter proposes additional solutions that can be implemented within the Comcute system to increase its configurability. This refers to configuration of the reliability level in the W and S server layers, static or on-the-fly data partitioning and integration, configuration of the system for processing in the data streaming fashion, extending the system for selection of a project that the client wants to contribute to, ease of migration of legacy codes to the system. Finally, an example of a legacy distributed application for monitoring client locations and resource usage is presented with suggestions on its migration to the Comcute system environment.


  • Genetic Positioning of Fire Stations Utilizing Grid-computing Platform

    A chapter presents a model for determining near-optimal locations of fire stations based on topography of a given area and location of forests, rivers, lakes and other elements of the site. The model is based on principals of genetic algorithms and utilizes the power of the grid to distribute and execute in parallel most performance-demanding computations involved in the algorithm.

    1.1. Fire spreading simulation

    The starting point for choosing locations of fire stations is to understand and possibly simulate the way spreading fire behaves on a given terrain. Topography of the land, placement of forests, rivers and other elements of the site all need to be taken into account to perform an accurate simulation.

    The model used for implementation is based on the spatially explicit representation where the whole terrain is divided into square-shaped fields [1]. Each field of the map is described by [2]:

    • its location – e.g. a pair (x, y) representing coordinates in two-dimension space,
    • a set of variables representing field’s state – e.g. calorific value, fire intensity value, altitude,
    • a finite set of neighborhood fields – the neighborhood relation is defined as Moore neighborhood (shown in Fig. 9.1 – gray fields constitute the neighborhood of black field) and comprises eight fields surrounding the central one,
    • a transition function which calculates new state of the field as a function of its present state, state of its neighbors and time interval between present and to-be-calculated state.

    The flow of time is simulated as a run of discrete time deltas. During each time step, state of a field can change according to the transition function. The simulation is in turn performed in subsequent iterations – with each iteration representing bygone time delta and dependent on the state from the previous step. During every iteration of the simulation the transition function is called for all fields to calculate their new state.

    Each individual field can be in one of three possible states: inactive (before ignition), active (under fire) or burned out. Transition from inactive to active state requires a specific level of fire intensity spread from neighbor fields – its an ignition threshold. It helps to better approximate the actual behavior of fire spreading in reality.

    Moore neighborhood
    Fig. 9.1. Moore neighborhood [2]

    The parameter of calorific value plays an important role during the simulation. Firstly, it determines the ignition threshold. Fields with higher calorific value have lower threshold – they are easier to set on fire. Secondly, calorific value determines how long and how intensive given field will burn. A field is considered as burned out when its calorific value reaches zero, so the higher the value is, the longer field can burn.

    Calorific value also influences the fire intensity value of a field under fire. Transition function uses saturation arithmetic when calculating the fire intensity, with maximum value dependent on the original calorific value of the field. Fields with higher original calorific value burn with greater intensity [1]. Existence of forest on a given field translates to higher calorific value, while fields occupied by rivers, lakes or seas have calorific value equal to zero – those fields cannot be set on fire.

    The key part of the simulation is the transition function. Its purpose is to calculate new state of the field based on data from previous iteration of the simulation. New fire intensity is calculated by summing the fire amount that spreads from all neighbors to the considered field. The fire intensity of the field itself from the previous iteration is also taken into account and the final value is truncated in accordance to the saturation arithmetic.

    The amount of fire spread from neighbor fields is determined by coefficients of the simulation (possibly independent for each and every neighbor). Those need to be chosen experimentally to suite desired velocity of fire spreading. The value of fire spread from different neighbors is also influenced by the wind [1]. Fig. 9.2 shows two examples of coefficients values for neighbor fields.

    coefficients for no wind conditions
    a) coefficients for no wind conditions
    coefficients for windy conditions (wx=0.05, wy=0.1)
    b) coefficients for windy conditions (wx=0.05, wy=0.1)

    Fig. 9.2. Example values of simulation coefficients

    Fig. 9.2a shows an example of coefficients in no wind conditions. Coefficients for fields adjacent by edge to the central one are bigger than those for fields adjacent by corner by a factor of π (approximately). The motivation for such distinction is that the Euclidean distances between centers of respective fields and the middle of central field also differ by a factor of π. Fig. 9.2b shows coefficients influenced by wind. It is defined as a pair (wx, wy) representing horizontal and vertical components of the wind. Parameters wx and wy are dimensionless and are added (or subtracted) to the original coefficients of the fields depending on the direction of spreading.

    Fire intensity of neighbor is multiplied by respective coefficient and added to the overall intensity of the field. New values are calculated based on a snapshot of values from previous iteration. In each iteration, calorific values of fields are decreased according to the current fire intensity.

    1.2. Adding firefighters to the mix

    So far, a model has been defined for simulation of fire spreading. To determine proper locations of fire stations, influence of firefighters has to be taken into account as well. Two aspects that require particular attention are velocity of firefighters and the time required to suppress fire of different intensity on fields within the reach of firefighters. The number and individual behavior of firefighters in not considered in this model. It is assumed that there is always enough firefighters to extinguish the fire.

    Velocity determines how long will it take before firefighters reach the fire. It is assumed that firefighters can move in any direction (excluding fields representing water reservoirs) and the maximum velocity is calculated based on terrain slant and the type of site. For example, firefighters move slower in forest due to existence of obstacle (in contrast to fire, which moves faster due to high calorific value).

    When firefighters reach a field which is under fire, the influence of water (or other extinguishing means) in every following iteration is twofold:

    • it reduces fire intensity, which is the direct effect of extinguishing,
    • it reduces calorific value, so that fields visited by firefighters are less likely to be set on fire again (the longer water is applied to a field, the less likely re-ignition becomes).

    Parameters of the simulation need to be fine-tuned to best reproduce real-world conditions.

    To evaluate how suit a set of fire stations for extinguishing fire on a given terrain, a set of test cases has to be executed. Each test case consists of a set of locations pointing where fire is ignited around the terrain. Every set of fire stations is tested against multiple fires started in different place. It is needed, because in reality fire may be set up in any location. A set of fire stations that handles successfully the most of test cases, scores the best results. This way selected locations are not suited for only one scenario of fire spreading but covers a whole range of possible cases. The result of evaluation is a single number denoting the overall area that was burned during simulation of all test cases.

    1.3. Choosing best locations

    With a model for testing a single set of fire stations already defined, the remaining issue is how to choose sets of fire stations for evaluation to maximize the chance of finding optimal (or near-optimal) solution. The basic approach is to choose sets of stations randomly and test which of them are rated highest. The bigger input data is, the greater is the chance that one of randomly chosen sets scores good result. The problem is that increasing the size of input data also increases processing time significantly.

    1.3.1. Grid-level parallelism

    One solution, to the issue stated above, is to use a distributed, high-performance computing platform for the needs of simulations execution. Individual simulations can be run independently on parallel machines to reduce the overall time of tests [5]. The fire spreading model described above was implemented as a computational module for the Comcute grid-computing platform. The target platform makes use of processing power donated by volunteers. By distributing computational tasks – e.g. one set of fire stations tested by each participant – great overall performance can be achieved.

    Computational module was implemented as a Java applet which runs in the browser of a volunteer. This way, joining the grid does not required installation of any additional software on volunteer’s computer. Participants donate their processing power by a single click of a link on Comcute’s project website.

    Task distribution model of Comcute platform is especially useful in case of emergency situation. If there is a huge forest fire, described computational module can be used to determine locations where firefighter troops should be send. In such sudden situation a lot of participants can easily attach their machines to the grid ad-hoc by simply visiting project’s website. Those can be both volunteers and people obligated to donate unused processing power of their machines in case of a crisis (e.g. employees of public institutions).

    1.3.2. Volunteer’s machine-level optimizations

    Distributing computations among volunteers connected to the grid can decrease time needed to perform simulations for whole input set by a huge factor. However, the time of a single simulation run on volunteer’s machine can be significant if the computational module is not designed with performance in mind.

    Typical sizes of maps used for simulation range from 1024×1024 to 4096×4096 pixels. For the smallest map, usually about 1500 iterations of simulation are required to get to the point where fire is extinguished and there is no activity that could change situation on the map in any significant way. In a naive approach, computational module would iterate over whole map in every single iteration. Such approach would require 1 572 864 000 calls of the transition function in case of the smallest map. Time required for such simulation can be up to several or even over a dozen – in case of slower machines – minutes. Given the fact that a test of a single set of fire stations consist of several test cases, it can take over an hour to obtain a sole mere result.

    The main problem of the described naive approach is that in each iteration of the simulation a lot of fields are processed unnecessarily. Those include fields that represent water reservoirs and places that fire did not reach yet. Depending on the map of the terrain, omitting rivers, lakes or seas can greatly improve the time of the simulation.

    Avoidance of processing of fields that are not under fire (and do not have neighbors which are under fire) is especially important at the beginning and at the end of simulation. In the initial phase, only a small group of fields is under fire and it will take many iterations before fire covers larger area. At the end, most field are already burned or fire has been suppressed on them so any further iterations will not change the state of those fields. Processing whole map in those cases is a waste of processing power.

    In a perfect situation, only fields under fire and their neighbors would be processed during each iteration. However, it would not be memory efficient to keep a list of individual fields involved in the simulation. Moreover, because of the size of such list, operations like looking for duplicates (a field can have two neighbors under fire, which would qualify it twice for processing in next iteration) or removing a field could take a significant time. Those issues could be diminished by the use of associative containers with guaranteed constant complexity of operations. Nonetheless, the issue of memory usage still remains. Additional problem is that handling each field individually does not facilitate efficient parallelization of work. Time of processing of a single field is so short that the cost of dispatching and handling of such task would surpass performance gain from simultaneous execution.

    The solution to the problem is to divide the whole map into smaller blocks which contain a set of fields each. Fig. 9.3 shows an example of dividing a map of size 16×16 fields into blocks of size 4×4 fields.

    Division of a map into blocks
    Fig. 9.3. Division of a map into blocks

    Blocks are then used to determine which parts of the map require processing in the next iteration of the simulation. Black fields represent an area which is under fire after a given iteration (the actual amount of fire is not important – only the fact that there is some amount of fire that can spread is relevant in this example). Blocks with gray background are the ones which will be processed during the next iteration. The top-most block with gray background does not contain fields under fire but it has neighbor fields in that condition from another block. It means that there is a possibility that fire will spread between blocks in the next iteration. With described optimization, the time of a single simulation was decreased from several minutes to under a minute.

    Further optimization is based on the fact that processing of each field is independent from processing of all other fields. New state of a field is calculated based on the state of it neighbors from the previous iteration. It means that processing of fields within the same iteration can be performed in parallel without any locks or critical sections. To reduce the time of a single iteration, implemented Java applet creates a pool of worker threads and dispatches processing of blocks to them. Each block is processed sequentially by the worker thread. As was already stated, processing of individual fields in parallel would not provide any performance gain due to short time of calculations compared to the cost of dispatching tasks to the worker thread. By the use of block as a unit of work for worker threads, the time of processing of dispatched jobs is increased and overcomes the cost of dispatching and handling.

    Multi-threaded execution decreases the time of simulation (down to below half a minute on dual core processor with Hyper-Threading) but at the same time, it increases the load generated on volunteer’s machine. The number of worker threads created by the Java applet is one of its internal parameters and can be fine-tuned depending on the situation. In normal conditions only one worker thread can be used to not overload volunteer’s machine. In case of emergency situation the size of worker threads pool can be increased to obtain needed results faster.

    1.4. Improving the results – genetic approach

    Results of a single run of simulations already bring some knowledge about optimal placement of fire stations. They are, however, only as good as the input data was. To improve received results, it is possible to increase the number of tested sets of fire station. The problem with such approach is that the overall time of simulations can increase significantly without any guarantee that final results are better than with smaller set of input data. It is, after all, just a random search in the – quite infinite – space of all possible solutions. The best approach would be to use results of one run of simulations and build upon them to produce another set of input data that could score better results.

    The one solutions that comes up almost instantly is the use of genetic algorithms. They perfectly match the theme of the desired solution described above. What was previously called input data set, becomes a population according to the terminology of genetic algorithms. A single set of fire stations becomes an individual in the population. By using genetic operations like crossover and mutation one population can be transformed into another – with individuals expected to score better results – which would represent the next epoch. Fig. 9.4 shows an overview of a genetic algorithm applied to the problem of finding near-optimal locations for fire station.

    Overview of genetic algorithm
    Fig. 9.4. Overview of genetic algorithm

    Genetic algorithm typically consists of the following steps [4]:

    1. Initial population generation.
    2. Evaluation of individuals.
    3. Selection
    4. Crossover and mutation.
    5. Succession.

    Steps 2-5 are repeated in subsequent epoch.

    In case of positioning of fire stations the most demanding step (in terms of processing power) is the evaluation of individuals. As was previously described, to obtain an accurate rating for a single set of fire stations, a handful of simulations need to be performed. By utilizing the power of the grid (as shown in Fig. 9.4), time required for a single epoch is greatly decreased.

    1.4.1. Selection

    Simulations performed on the grid result in a ranking of individuals ordered according to how well they handled fires. Results from that ranking can be used to calculate values of the fitness function. Fitness values are then used in the stage of selection of genetic algorithm. The basic approach is to use roulette-wheel selection, in case of which the probability of an individual being chosen for future procreation is proportional to the score it received during the evaluation stage. Fig. 9.5 shows an example of roulette-wheel build for five individuals.

    Example of a roulette-wheel for five individuals [3]
    Fig. 9.5. Example of a roulette-wheel for five individuals [3]

    Area occupied by each individual is proportional to its fitness value. Individual E got the highest rating during the evaluation stage and has the biggest chance to be selected for future procreation. Individual A got the lowest rating, so it is least likely that it is picked. It is not impossible, though. Presence of such lower-rated individuals can help to get out of local maximum of fitness function (where highest rated individuals may fall in instead of global maximum).

    There is a risk, that when creating a new population through randomized operations of selection, crossover and mutation, the best individual from the previous population is lost. To prevent the disappearance of the best chromosome from the population, a method called elitism is used. It copies best individuals from previous population to the new one surpassing the procreation stage [3]. In the worst case the fitness value for the best individual in the new population is not improved compared to the previous population but also does not decline. It improves the performance of genetic algorithm by reducing the number of epoch required to obtain desirable results [3].

    1.4.2. Crossover and mutation

    For the purpose of genetic algorithm two basic genetic operations need to be defined for considered individuals: crossover and mutation. The purpose of the crossover is to produce offspring by choosing genes from parent chromosomes [4]. In case of discussed individuals single gene corresponds to a single location of fire station. The crossover operation is to exchange a pair of fire stations between two individuals.

    Fig. 9.6 shows an example of crossover. Positions of fire stations are marked with dots. In the Figure 6a parent individuals are presented. Stations (or genes) chosen for exchange are additionally marked with rings. Figure 6b shows two offspring produced by crossover. Empty rings represent positions of original genes before crossover. The dots marked with rings are the exchanged stations.

    (a) parents chosen for crossover

    (b) two offspring

    Fig. 9.6. Example of crossover of two individuals

    Mutation of a chromosome is a random change of some of its genes [3]. As genes correspond to fire stations, mutation is as simple as a random displacement of one or more stations. Fig. 9.7 shows an example of such mutation. In Fig. 9.7a an individual before mutation is presented. The fire station (the gene) which is about to undergo mutation is marked by a ring. Fig. 9.7b shows the individual after the mutation. Empty ring represents the original position of the station (the original gene) and the arrow shows the displacement of the station during mutation.

    individual before mutation
    a) individual before mutation
    individual after mutation
    b) individual after mutation

    Fig. 9.7. Example of mutation of an individual

    1.4.3. Bringing the pieces together

    By combining selection, crossover and mutation techniques described above with already-implemented Comcute computational module for evaluation of individuals, a complete genetic algorithm is obtained. It was implemented as a Java desktop application (shown in Fig. 9.8).

    Main window of Java application implementing genetic algorithm
    Fig. 9.8. Main window of Java application implementing genetic algorithm

    The graphical user interface of the application provides fields for setting up parameters of the simulation (location of the map, number of fire stations in each set, number of test cases) and parameters of the genetic algorithm (population size, epoch count). Identifier of the Comcute computational module also needs to be provided to offload evaluation tasks to the grid.

    Application uses Comcute platform’s customer API to register tasks and input data, monitor the state of execution on the grid and to retrieve final results. Communication between Java application and Comcute platform takes place through web services. For the needs of authentication, a dedicated customer account (with adequate permissions) has been created on the Comcute platform for use by the Java application.

    Java application performs subsequent steps of the genetic algorithm (as shown in Fig. 9.4). Evaluation stage is offloaded to the grid. Results of evaluation are retrieved and used as an input data for selection and further steps of the algorithm. At the end of iteration, a new population is obtained for the next epoch. Evaluation of individuals is again offloaded to the Comcute grid and the algorithm continues as above till reaching the desired number of epoch. Data from subsequent epoch is recorded for future analysis.

    1.5. Tests and results

    Presented solution was tested with the help of anonymous volunteers donating processing power of their computers to Comcute grid project. Fig. 9.9 shows the main project website with fire simulation in progress. To encourage users to participate in the project, a live visualization is presented on the website during the simulation. Fig. 9.10 shows four snapshots from a simulation performed on one of volunteers’ machines.

    During the simulation, an increased system load can be observed. Figure 11 shows history of CPU, memory and network usage during the start and execution of a simulation. The top-most chart shows CPU load history. It presents four series of data – one for each of processor’s cores. The middle chart shows the history of RAM usage. It was truncated as all plotted values do not exceed 20%. The bottom-most chart shows network usage.

    Point A in the Fig. 9.11 represents the moment when user activated a link on the Comcute website to participate in the computations (plots before that point represent load of an idle system with running web browse). Period A-B represents activity of Comcute loader program. Its purpose is to download computational module and other required resources (e.g. additional libraries) and pass the control flow to the module. High peek of the network usage shows when resources were downloaded.

    Comcute grid website with fire simulation in progress
    Fig. 9.9. Comcute grid website with fire simulation in progress

    Fig. 9.10. Example of ongoing simulation

    Point B marks the beginning of computational module execution. When the simulation starts, all cores of the processor are utilized (the number of worker threads can be tuned, as stated before). Points C and D show how system load drops between execution of subsequent test cases. Each test case is run using multiple worker threads but the switch to another test case is a single-threaded operation. According to the chart, execution of a single test case takes approximately 15-20 seconds.

    CPU, memory and network usage at the start and during the simulation
    Fig. 9.11. CPU, memory and network usage at the start and during the simulation

    Values from several epoch of genetic algorithm were recorded to analyze if genetic operations produce individuals that score better results. The ultimate criterion is the score of the best individual in the population in subsequent epoch. A chart presenting area burned during simulations involving best individuals is shown in Fig. 9.12. The x-axis represents epoch and the y-axis shows burned area. As can be easily seen, scores improve over subsequent epoch.

    How much can be improved during the execution of genetic algorithm, is highly dependent on the randomly chosen initial population. If one of initial individuals is already close to the optimal solution, improvements are slight at best (and there can even be no improvement at all). On the other hand, if none of initial individuals is near the optimal solution, significant improvements are only seen after considerable amount of epoch.

    Area burned during simulation involving best individual of the population in subsequent epoch.
    Fig. 9.12. Area burned during simulation involving best individual of the population in subsequent epoch.

    Fig. 9.13 shows a set of fire stations which corresponds to the best individual from epoch 15 of genetic algorithm execution presented by the chart in Fig. 9.12.

    Location of fire stations corresponding to the best individual from epoch 15
    Fig. 9.13. Location of fire stations corresponding to the best individual from epoch 15 (see Fig. 9.12).

    1.6. Conclusions

    Performance of a single simulation performed on volunteer’s computer could be further improved by the use of NVIDIA® CUDA technology. It would give possibility to offload multi-threaded computations to the graphic card, which it is best suited for. The performance gain would allow to execute more testcases (which increases the accuracy of evaluation) or to perform simulation on higher-resolution maps (which increases the accuracy of simulation) or possibly both. Described implementation uses Java applet as the technology for Comcute computational module and libraries like JCUDA [6] (a set of Java bindings for NVIDIA CUDA libraries) would provide means for utilizing the power of volunteers’ graphic cards.

    References

    1. Almeida R. M., Macau E. E. N., Percolation model for wildland fire spread dynamics, Proceedings on International Conference on Chaos and Nonlinear Dynamics, São José dos Campos. South America, 2010.
    2. Foster I., Kesselman C, Tuecke S., The anatomy of the grid: Enabling scalable virtual organizations, Int. J. High Perform. Comput. Appl., 15(3), August 2001, pp. 200-222.
    3. Nedjah N., A. Abraham, Luiza de Macedo Mourelle, Genetic Systems Programming: Theory and Experiences, Springer Verlag, New York 2009.
    4. Pyne S. J., P. L. Andrews R. D. Laven: Introduction to wildland fire, John Wiley and Sons, New York 1996.
    5. Russell S. J., P. Norvig, Artificial Intelligence a modern approach, Prentice Hall, Upper Saddle River, 2nd edition, New York 2003.
    6. JCUDA Project, http://www.jcuda.de/, May 2012.

  • Foundations of Grid Processing Architecture for the Comcute System

    In the chapter, fundamental system algorithms and structures implemented in the Comcute system are described and analysed in detail. Layered architecture of the system model is highlighted. System tasks of the layers are elaborated, presented and described. Operational details of communication interfaces among layers are worked out and examined. The focus is put onto implemented system components with regard to their operability and efficiency. Scalability and system openness, as the key design factors of the implementation, are deliberatively taken into account. Important aspects of operability are addressed and the issues of validation, verification and deployment of the adopted system solutions are discussed. Practical application aspects of the Comcute system are described with respect to its final implementation and target installation in the form of grid processing in the open worldwide Internet.

    1.1. Design issues

    Several design versions of the Comcute system have been elaborated and successfully implemented. At the present stage, the system has been tested, verified and validated in its all fundamental functionalities. Scalability tests have been carried out within required application scope. In practical experiments, the Comcute has processed both system layered tasks and application computations according to design requirements. As applications, several instances have been run: breaking of DES codes with specified lengths, generation of great prime numbers of defined properties and research on text file similarities in compression processes. The obtained results of the experiments have allowed elaboration of general conclusions and experiences for deployment of the Comcute system in the form of grid computing into the Internet. The two processing paradigms have been considered: volunteer computing as well as obligatory computing [1], [2].

    1.2. System architecture

    The layered architecture has been elaborated for the Comcute system [3]. Layer W is responsible for task and data distribution, as well as for delivery of execution modules and data packages for processing. Layer S organises the processing, in layer I the processing is being generically carried out. Passing of results starts at layer I and goes up to layer W through layer S. This has been presented in conceptual scheme in Fig. 3.1. Moreover, above the layer W, there has been introduced high-level layer Z. Layer Z is to provide and serve as an entry interface to the Comcute system for clients who define and launch their applications.

    General conceptual scheme for the Comcute system
    Fig. 3.1. General conceptual scheme for the Comcute system

    The conceptual scheme (Fig. 3.1) illustrates operational implementation of layer S in cooperation with layers W and I in the process of execution of required computational sub-tasks. The sub-tasks are components of high-level applications delivered form layer Z. In practice, layer S receives complete program modules with attached data packages. Next, both program modules and data packages are being sent down to layer I [4].

    1.2.1. Inter-layer cooperation

    The conceptual scheme (Fig. 3.1) has been developed and elaborated with respect to realisation details. In Fig. 3.2, system architecture and complete operational flowchart are presented. Practical processing aspects have been deliberatively spotlighted. The architecture concentrates around four main conceptual components. They are: server W, servers S, Load Balancer and global user area (Internet), respectively.

    Design of operational diagram of the Comcute system
    Fig. 3.2. Design of operational diagram of the Comcute system

    Server W monitors operational activities and controls task execution at the higher layer. It distributes task modules to relevant servers S, and next, collects and combines the obtained results. Server W is the entry point for an application client. Servers S disseminate task modules to Internet users, control their executions and distribute data packages. Servers S verify correctness of the results obtained form Internet users. Consecutive component: Load Balancer connects Internet users to servers S having small temporary processing workload. In case an Internet user comes to the Comcute system, server S gets its own copy of a task module with relevant data packages.

    According to the chart presented in Fig. 3.2, one may enumerate consecutive operational steps:

    1. servers S are registered at server W
    2. server W distributes task modules and data packages
    3. server W determines task allocations according to Load Balancer data
    4. Internet user opens a web page
    5. Internet user selects the link of Load Balancer
    6. Load Balancer attaches the Internet user to a selected server S
    7. web browser of the Internet user gets its task
    8. the task connects to the Load Balancer in order to get connection to server S
    9. the task gets its data package for processing
    10. results are sent back from the Internet user to server S and to server W.

    The proposed solution is sensible and clear, and has allowed achievement of specified operability and effectiveness. One may notice that the operational diagram incorporates key features of scalability and system transparency.

    1.2.2. Applications

    The Comcute system has been successfully implemented, tested and validated. In its current version, one may compute one’s own processing tasks. As applications, several instances have been run: breaking of DES codes with specified lengths, generation of great prime numbers of defined properties and determination of text file similarities in compression processes.

    1.2.3. Data packages distribution for tasks

    Task distribution in the Comcute system is not limited by the system itself, and is merely defined by types (classes) of computation applications and their own inherent characteristics. A Comcute user should only deliver task code modules with relevant data packages and may control the distribution processes. The Comcute system does not confine the range of processing models, and allows running of almost any of currently used distributed network processing paradigms, e.g. cloud or grid computing.

    Flowchart of task distribution model for the Comcute system
    Fig. 3.3. Flowchart of task distribution model for the Comcute system

    In the current version of the Comcute system, the two models of data distributions have been carefully tested and verified. In applications of great prime numbers generation and breaking DES codes, data packages are not correlated with one another, and so consecutive sub-executions can be processed separately and independently. Every data package can be practically processed by any random Internet user. Data distribution, is this case, is significantly simpler. Data is packaged into portions, and each package is assigned its unique identifier. The identifier is used in result collection processes, after completion of the tasks’ executions. In case of text file content similarity determination, partial results have to be used in consecutive processing, and they should be stored temporarily in servers S. In general, in many cases, the data is connected inherently, which results in necessity of inner communication. The issue can be solved by use of functionalities of Distributor component in layer S (Fig. 3.3). In layer S, communication among servers is not considered, and there is no possibility of data exchange within the layer. Internet users should get relevant data packages, so correlated data should be placed onto the same server S. In server W, Data Generator distributes relevant data packages to selected servers S, solving the issues according to adopted optimisation procedures [5], [6].

    1.3. Implementation specifications

    In the implementation of the Comcute system, several classes of functionalities have been realised. Among others, one may enumerate the operational functionalities and software components like: user interface, inter-node communicator, layer S interface, workload monitor, task status monitor, task execution controller, repository for tasks, task distributor, results collector, repository of results, and many others. They are characterised below:

    User interface

    • reception of high-level applications for execution,
    • controlling of tasks execution status,
    • sending results to a user.

    Inter-node communicator

    • general communication with servers of layer W,
    • communication with subsets of servers in layer W,
    • sending messages to selected server sub-groups,
    • reception of messages form selected servers of layer W.

    Layer S interface

    • reception of messages form selected servers of layer S,
    • data sending to selected servers S,
    • reception of results from servers of layer S.

    Workload monitor

    • determination of inner workload of a computing nodes,
    • global determination of workload of servers in layers W and S,
    • generation of ordered list of workloads of servers in a layer.

    Task status monitor

    • status control of tasks being in execution,
    • reporting of task execution statuses.

    Task execution controller

    • task partitioning into execution module,
    • distribution of modules with task codes,
    • reception and analysis of partial results,
    • verification of results of task execution,
    • verification of task termination conditions,
    • consolidation of partial results into complete ones.

    Repository for tasks

    • storing and delivering of execution codes for tasks,
    • delivering and storing of data packages for tasks,
    • collecting and storing partial results of tasks.

    Task distributor

    • deploying of tasks on servers in layer S,
    • distribution of data packages for processing,
    • reception of results form execution servers in layer S.

    Results collector

    • starting of execution code for results consolidation,
    • storing of results in results repository.

    Repository of results

    • storing of complete final results,
    • results presentation for Comcute users,
    • verification of execution of the processed tasks.

    In general, the specified functionalities and software components have been successfully tested and verified. The construction of the functionalities satisfies the system transparency and interoperability requirements and the software can be adapted to paradigms of grid or cloud computing in the Internet.

    1.4. Conclusions

    Architectural and conceptual foundations of the Comcute system have been elaborated and described. The Comcute system has been implemented, verified and validated. Several practical applications have been processed by the system. The results obtained for applications: DES codes breaking, great prime numbers generation and text file similarity determination by compression, have confirmed and proven the applicability of the global concept for practical realisation. At present, the system can be deployed in the Internet and can benefit from practically unlimited processing powers of personal computers in the global network for many various types of applications [7]. In general, one can notice that the structure and conceptual solutions of the Comcute system prefer massive processing of great number of small independent tasks, executed according to single instruction multiple data (SIMD) processing paradigm. However, also other types (classes) of distributed network processing, like grid or cloud computing, can be supported and realised in practice within acceptable desired level of efficiency and computing optimality.

    References

    1. Brudło P.: Berkeley Open Infrastructure for Network Computing, Chapter in monograph: „Distributed Calculations in Computer Systems of Grid Class Architecture”, Publisher: Gdansk University of Technology, Faculty of Electronics, Telecommunications and Informatics, Gdańsk, Poland, 2012, pp. 39-45.
    2. Balicki J., Brudło P., Czarnul P., Kuchta J., Matuszek M., Szpryngier P., Szymański J.: Functional Design of the Comcute System, R&D technical report 33/2011, Gdansk University of Technology, Faculty of Electronics, Telecommunications and Informatics, Gdansk, Poland, 2011.
    3. Brudło P.: Implementation Issues in the Comcute System, Chapter in monograph: „Distributed Calculations in Computer Systems of Grid Class Architecture”, Publisher: Gdansk University of Technology, Faculty of Electronics, Telecommunications and Informatics, Gdańsk, Poland, 2012, pp. 127-136.
    4. Brudło P., Kuchta J., Szpryngier P., Szymański J.: Requirements for Implementation of Selected Computational Methods for the Comcute System, R&D technical report 36/2011, Gdansk University of Technology, Faculty of Electronics, Telecommunications and Informatics, Gdansk, Poland, 2011.
    5. Brudło P., Czarnul P., Kuchta J., Szpryngier P., Szymański J.: Characteristics of Distributed Computations for the Comcute System, R&D technical report 41/2011, Gdansk University of Technology, Faculty of Electronics, Telecommunications and Informatics, Gdansk, Poland, 2011.
    6. Balicki J., Bieliński T., Brudło P., Paluszak J., Szymański J: Dissemination of Computations with Distribution Servers, Chapter in monograph: „Distributed Calculations in Computer Systems of Grid Class Architecture”, Publisher: Gdansk University of Technology, Faculty of Electronics, Telecommunications and Informatics, Gdańsk, Poland, 2012, pp. 113-126.
    7. Brudło P.: Security in Monitoring, Chapter in monograph: „Distributed Calculations in Computer Systems of Grid Class Architecture”, Publisher: Gdansk University of Technology, Faculty of Electronics, Telecommunications and Informatics, Gdańsk, Poland, 2012, pp. 175-184.

  • Security Mechanisms in the Comcute System

    The aim of this paper is pointing out the basic security problems and mechanisms in the Comcute system – maintenance system of large computing power in the face of critical crisis. Moreover security mechanism and tools useful to apply in laboratory model as well as target version of the Comcute system are presented.

    1.1. Assumptions and security requirements

    Laboratory Comcute system is used to investigate the distributed system properties and ability to disperse compute-intensive calculation over computers in local area network (laboratory) or over computers connected to Internet. Laboratory system model should evolve with time and intensity of research (according to the intentions of designers) probably into prototype of the target system. For this reason security mechanisms related to the prototype are still valid and important as well as for ready to operation system.

    However we need to express some additional remarks. When the laboratory system usually works in the closed environment (university laboratories) and simulates behavior of the target system, then fully fledged system will be available and accessed on many layers from Internet. Moreover, the Comcute system temporarily will store some client data and information about results of computations. For this reason designing, planning and deploying of security policy is necessary. Security policy document should be approved by directorate of organization (owner of the target Comcute system) and available to all clients as some kind of calculations quality guarantee and expected good practices with client data and other client resources. Such security policy document should contain following items [1,4,6, 9,10]:

    1. General provisions, basic legal constraints, glossary of items;
    2. Authentication and identification rules;
    3. List of available services and terms of usability;
    4. Description of resources available during computations;
    5. Risk analysis against security costs and the expected benefits;
    6. Operation requirements for security (selected communication activities, protocols, provision rules of the services, etc.);
    7. Security remedies, tools and remedies to achieve security targets;
    8. Security and risk management (audit, control frequency, use of external auditor companies), staff and client policies.

    Let assume following types of the Comcute users [2]:

    • Configuration (architecture) administrators;
    • Security administrators – privileges and roles (authorization) managers, etc.;
    • System operators and managers (task configuration, computation flow management and control);
    • Clients – principals of calculations.

    Depending on the number of users different models of access control may be considered. For smaller number of users (up to few hundred) one can choose more restricted model (like Bella-LaPadula or other Mandatory Access Control model) [6] or some milder (Discretionary Access Control with object separation) [5,6]. For bigger number of users (>1000) very expensive on introduction stage but easy to manage is Role Based Access Control (RBAC) model [5].

    1.2. Information Security Management

    General model of security management layers are shown in Tab. 6.1:

    Tab. 12.1. The layers of general security management model

    Security Management
    (applications, databases, EDE, e-mail, etc.)


    Security Agents, Security Protocols
    (authentication, key management, etc.)


    Security Services
    (confidentiality, integrity, non-repudation, etc.)


    Security mechanisms
    (digital signature, authentication)


    Basic Modules
    (algorithms, modes of operation)


    In this chapter we are going to discuss some issues from layer 3 (security services) together with security mechanisms (layer 2 – commonly known and described) and partially security protocols (layer 4). All security management elements must be linked with security policy (which is out of scope in this chapter). However basic security modules like algorithms, modes of operation, key generators, random and pseudo random number generators, etc. are perfectly described [3,4,5,6] and there is no need here to cover this area. Below there is some explanation how the basic cryptographic protocols used for building security mechanisms. The aim of first one – secure message transfer with encrypted session key [3,4,5,6] – is ensuring the confidentiality content of transmitted message or file. Additional side effect is ensuring receiver’s authenticity of the message. It runs as follows:

    1. Sender prepares message M, generates random session key K and creates cryptogram C__K(M) using symmetric cryptographic algorithm of good quality like AES, Twofish, etc.

    2. Sender retrieves Receiver’s public key O__pu from trusted database or retrieves it from public key certificate signed by trusted Certification Authority (CA) and then creates cryptogram of session key encrypted with receiver’s public key. Sender uses the same symmetric cryptographic algorithm in ECB mode as in the step A [3]. Additional cryptogram of session key C__Opu(K) is concatenated with cryptogram C__K(M) and all these enciphered messages are sent to Receiver. Please note that in one message can be placed many encrypted messages for many receivers, each containing session key encrypted with each receiver’s public key.

    3. Each receiver of the message deciphers session key K from cryptogram using corresponding private key O__pr (forming a pair with public key O__pu) and obtains K=D__Opr(C__Opu(K)).

    4. In following step Receiver deciphers message M using session key K: M=D__K(C__K(M))

    Digital signature protocol is second basic mechanism used to achieve authenticity, integrity and non-repudiation of communication. Digital signature (a cryptogram created with use of Sender’s signing key ) is de facto an appendix to a message. It runs as follows [4,11]:

    1. Sender prepares message M, calculates hash of the message H(M) using one-way function of good quality like SHA-1, SHA-256 and then, using some digital signature schema (like RSA, DSA) and signer’s private key K__S, generates cryptogram (signature) S__Ks(H(M)).

    2. Sender concatenates message M with digital signature S(H(M)) and sends forward whole packet to Receiver.

    3. Receiver retrieves Sender’s public key K__Spu from trusted database or retrieves it from public key certificate signed by trusted Certification Authority (CA), then calculates a value of one-way hash function H(M’) from received message and uses both these values (H(M’) and K__Spu) to validate authenticity of digital signature S__Ks(H(M)).

    4. If signature is valid (for example in RSA schema this means that hash value of received message H(M’) is equal to hash value H(M) deciphered with Sender’s public key K__Spu), then accept message as original and credible. In other case a message is rejected as originated from unknown source or probably message was destroyed during transit.

    Public Key Infrastructure (PKI) [4,11] is composed as a network of CA (Certification Authority), RA (Registration Authority) – servers used for user verification and registration altogether with security policy defining all public key management procedures. Public key certificate is a data structure, signed by issuer of certificate. This signed structure contain user identity data applying for certificate (and hence private key owner’s identity consisting a pair with this public key placed in the PK certificate), public key (the mandatory part of certificate), certificate identification number, issuer description and signature. Certificate can also comprise many additional data: issue date, time and the period of validity, certificate (keys) destination (application), PKI standard version, application constraints, delegation of privileges, etc. It is very noteworthy fact that public key certificate strictly connect user with his private key (of course it must be hidden and kept safely). With this property, using public key contained in a certificate, we can verify the authenticity of all user operations with his/her private key (consisting a pair with this public key published in a certificate)[11].

    1.3. Interface Security: Client – Comcute System

    In this section we look at security issues of the Comcute client interface or communication protocols between W servers layer and outer world [2]. In the section of the Comcute system design and implementation documents entitled as “System Architecture – requirements from client point of view”, this interface should allow for the implementation of following features:

    • Problem defining altogether with necessary data and parameters,
    • Source program defining and adaptation for distributed computing in Comcute system,
    • Results and their status (validity code) reading,
    • Additionally obtaining auxiliary information and data.

    There are typical menaces in this area of system activity [3,5,6]:

    • Eavesdropping, wiretapping – breach of confidentiality (passive attack),
    • Destroying or modification of communication, spoofing, etc. (active attacks),
    • Traffic analysis.

    We can distinguish at least three alternate methods of client communication with the Comcute system:

    1. Classical tiny interface (internet browser) – form to fill in taken from the Comcute webpage,
    2. As a webservice, published and available for authorized clients only,
    3. Other communication – any secure channel (direct personal communication, classical paper postage, delivery service, etc.).

    Depending on the method of communication we can point out following mechanism necessary to achieve minimum security level:

    1. For www (browser) interface – cryptographic tunnel SSL/TLS [5] and strong authentication methods – apart from login and password additional use, for example, of ZK (Zero Knowledge) identification protocol using smartcards [3]. Furthermore, it is worth considering additional limitations of communication traffic from only trusted nodes using properly configured firewalls to accept only chosen IP addresses. The SSL protocol and firewall are remedies for intruders and protect against eavesdropping and modification of communication, but do not secure against traffic analysis.

    2. For webservices – publication of service and interface details in web UDDI registers (OASIS) is subject to restrictions defined by standard description [7]. Available services must be secured in the same way as in p. a (SSL/TLS and strong authentication). And again – these remedies do not secure against traffic analysis.

    3. Other way of communication (direct, personal) can secure against traffic analysis. Additionally calculation service will be configured by operator. Such member of the Comcute system team can deploy many computations and for each one can select different latency of start point. Due to this property no one can establish for which client this calculations are deployed. In such case we can obtain protection against eavesdropping and modification of communication using the same remedies (SSL, operator authentication).

    1.4. W Server Layer Security

    In this section the W server layer security issues are described. The W servers are internal system nodes used for calculation task (issued by client or system operator) management. In the section of the Comcute system design and implementation documents entitled as “System Architecture – problems and solutions”, this system layer deals with following features:

    • Task receiving and verification (program code, data, additional computation requirements),
    • Data partitioning,
    • Task execution (finishing) criteria definition,
    • Eventually additional arbiter (judge) set up to evaluate the progress of calculation and to make some result of computation assessment, resolve conflicts, etc.
    • Conversion (eventually) of program code provided by client into execution code accepted by computers of Internet users,
    • Storage od result data and available it to client.

    As we can note from this list of functionalities, W servers store the important data (input and results of calculation) of external client. Let assume that there is more than one group of W servers, located in different local area networks connected via Internet. Each group of W servers can cooperate with chosen group of S servers. There are many possible menaces to materialize: attacks against integrity of stored data (unauthorized modification, destroying), wiretapping (reading of data and results of computation, client identity discovery). For these reasons data protection is very important and system designer must rethink it carefully on many levels:

    • Local area network containing group of W servers must be protected with properly configured firewall to accept external communication with chosen and strictly defined IP addresses only.

    • All communication between W servers is performed within SSL/TLS cryptographic tunnel using authentication for both sides.

    • Transaction mechanism should be implemented for distributed database used by W servers or full replication (database copies at each node of the W layer) should be considered, possibly synchronous, together with distributed transaction management [8].

    • Access control mechanism should delimit access to distributed database for local application only and for W server layer users (eventually for laboratory users of S layer).

    • Access to W servers layer should be limited to authenticated and authorized users only.

    1.5. Security Issues for Communication between W Server layer and S servers

    In this section the W server layer and S severs interoperation (distribution of calculation task and collecting of the partial results) security issues are discussed. In the section of the Comcute system design and implementation documents entitled as „System Architecture – problems and solutions”, in this communication link between system layers W and S there are following basic functionalities:

    • Transmission of computational task (program code and data subpacket) from W server to S servers.
    • Collecting of partial results from S servers by W layer.

    Possible mainly attacks at this communication link are: destroying or replacement of information parts of the packets (program code and data needed for calculation), spoofing, eavesdropping and data swapping (Man-in-the-Middle – MiM attack). Considering this menaces we can note that protection of WS communication link should consist of:

    • Providing consistency and authenticity of transmitted data using digital signatures [3].
    • Securing communication via SSL/TSL cryptographic tunnel or with use secure message transfer with encrypted session key, providing confidentiality and authentication of message receiver [3].
    • To protect against spoofing and MiM attacks there is need for use of public key infrastructure for proper use and management of PK certificates [11].

    1.6. The S Server Layer Security and Communication Link Protection between S Layer and Internet Users

    In this section the S server layer (service hosts of www providers) security issues are described together with communication link protection between S layer and computers of internet users. S server usually controls and manages the subtask of computation received from one of W servers. In the section of the Comcute system design and implementation documents entitled as “System Architecture – problems and solutions”, in this communication link between system layers S and I (Internet users) there are following basic functionalities:

    • Computation subtask transmission (or pointer to calculation task data located in different server, for example advertisement services) from S server to internet user computer I.

    • Computation partial results receiving from internet user computer I directly or indirectly – via ad server and in such case server W is notified only that calculation subtask was finished and results are accessible on other server.

    • Transmission of calculation partial results from S server to W server or a message transfer from S server to W server with information where the partial result data are located.

    We have to note that communication between internet user computer I and S server is out of control and can’t be managed from W servers point of view (and of course the whole Comcute system). The Comcute system cannot force mandatory communication enciphering (task program code and data packet) without additional agreement (contract) signed by both sides: the Comcute system owner and S server owner (manager) and/or advertisement server owner. When there is no cryptographic protection of communication links between S server and internet user computer I then we have to deal with typical attacks like spoofing, breaches, eavesdropping and traffic analysis. Digital signature protocol can provide some protection against integrity attacks on transmitted data.

    There is needed to discuss one more issue – what key (located in the public key certificate) we have to use for communication validity verification. To avoid necessary acceptance of public key certificate every time Internet user checks digital signature validity, this public key certificate should be signed and published by CA (Certification Authority) automatically recognized by every typical browser. It is necessary condition if we want not to stress internet user or engage her/him without any need for it.

    1.7. Necessary PKI Infrastructure

    Taking under consideration all previously discussed security issues we can note that there is need for two types of PK certificates:

    • Public key certificate issued and signed by CA commonly and automatically accepted (known) by browser. The private key associated with public key contained in the such public key certificate can be used for signing program code and data of calculation task sent to internet user computer from W server with use S server and eventually for signing all communication from W servers to S servers and, if there is such need, to advertisement servers also. We have to note that strong protection of private key is necessary to avoid compromise of whole system.

    • Public key certificates used for signing and protection of communication within the Comcute system, managed by internal PK infrastructure. Master key (root) can be signed, but it is not necessary, by qualified public key certification authority. Typically one can use self-signed PK root certificate for internal use.

    1.8. Conclusion Remarks

    In chapter the basic mechanisms necessary for the Comcute system protection and its elements were presented. Due to well-known threats from Internet the security policy for Comcute system should be developed first to define ways of achieving its objectives and procedures in terms of risk and danger. While some elements of laboratory system (which is partially isolated from external world) may be omitted, however in the case of target system it is advisable and necessary to design and implementation of following elements of security:

    • Security policy – document approved by the Board or the Comcute system owner;
    • Public Key Infrastructure – the hierarchical structure for whole system and Simple Public Key Infrastructure (SPKI) for Internet layer;
    • Mechanisms and protocols providing confidentiality and authenticity of communication.

    References

    1. NIST Documents NCSC-TG, 1986-1991.
    2. COMCUTE – design and implementation documents, Faculty of ETI, GUT 2010-2012.
    3. J. Menezes, P. C. van Oorschot, S. A. Vanstone – „Handbook of Applied Cryptography” CRC Press, 1997.
    4. Schneier B. – „Applied Cryptography”, J.Wiley&Sons, 1996.
    5. Gollmann D. –„Computer Security”, 3rd. ed., J.Wiley&Sons, 2011.
    6. J. Stokłosa, T. Bilski, T. Pankowski – Data Security in Computer Systems (in Polish), PWN, 2001.
    7. Nadalin, C. Kaler – OASIS Web Services Security: WS-Security Core Specification 1.1, http://www.oasis-open.org/committees/download.php/16790/wss-v1.1-spec-os-SOAPMessageSecurity.pdf, 2006
    8. M. T. Özsu and P. Valduriez –_ Principles of Distributed Databases (3rd ed.)_, Springer, 2011.
    9. Bosworth, S., Kabay, M.E. (edit.) – C_omputer Security Handbook, 4th ed._, J. Wiley&Sons, 2002.
    10. J. Pieprzyk, T. Hardjono, J. Seberry – Theory of Computer Systems Security (in Polish), Helion, 2005.
    11. Szpryngier, P.: SPKI – Public Key Infrastructure (in Polish). in: KASKBOOK. SaaS Technologies. ed. H. Krawczyk, Gdańsk: GUT, 2004.

  • Multi Agent Grid Systems

    This chapter presents an idea of merging grid and volunteer systems with multi agent systems. It gives some basics concerning multi agent system and the most followed standard. Some deliberations concerning such an existing systems were made in order to finally present possibilities of introducing agents into the Comcute system.

    1.1. Agents and Multi Agents Systems

    Emerging high-performance networks lead to popularizing distributed computing and introducing various computational paradigms like grid computing and volunteer computing. One of the developing architectures for distributed systems are multi agents systems which are based on autonomic agents.

    An agent is a computer system that is situated in some environment, and is capable of autonomous actions in this environment in order to meet its designed objectives. The agent can perceive its environment and act upon it [21].

    A Multi Agent System (MAS) is a system which consists of a number of agents. Agents are able to interact, mainly by exchanging messages possibly through some computer network infrastructure. In order to react successfully agents should be able to cooperate, coordinate and negotiate with each other [20].

    1.1.1. Agents as Service Providers

    A Service Oriented Architecture (SOA) can be regarded as a paradigm for organizing and utilizing distributed capabilities that may be under the control of different ownership domains [16]. It needs to be pointed that SOA is not a concrete architecture or not even tool as well as framework. It is a set of guidelines that leads to a concrete architecture.

    SOA guides in a process of creating and using business services during their lifecycle. It also provides conditions for the infrastructure which allows different applications to exchange data and participate in business process irrespective to operating systems or programming languages [15].

    Services are the main elements of systems implementing the SOA concept. By dictionaries those are defined as a performance of work by one for another [16]. OASIS additionally provides related ideas:

    • the capability to perform work for another,
    • the specification of the work offered for another,
    • the offer to perform work for another.

    It was said that agents exists in some environment. It may as well be environment of some kind of services. Those can be both, Web Services distributed on remote machines connected to the Internet and business services representing company activities mapped into computer system for the sake of simulations and automation. Agents can be treated as autonomous services providers and executors existing in such an environment. Moreover multi agent systems which assume communication and interaction between agents residing in the system, are suitable for this cause.

    When one agent is going to invoke a service of another one there is a need for some kind of agreement between them. Such an agreement should be made on the basis of some negotiations and be profitable for both sides. This actions can be described by Service Level Agreement (SLA) which is contractual obligations between a service consumer and a service provider, which can represent guarantees of quality of service (QoS), non-functional requirements of a service consumer and promises of a service provider [10]. An SLA can contain the following components [1]:

    • all sides involved into negotiation and execution, those besides contracting sides are supporting third parties such as monitoring, auditing, etc.,
    • description of the service specifying functionality delivered under the agreement,
    • service level objectives defining the service level of QoS parameters,
    • penalty for cases when service provider fails to comply with the contract.

    1.1.2. Agent FIPA Standard

    As long as agents work in an isolated environment without interactions with external systems there is no need for considering some widely accepted norms and standards. The situation changes when there is a need for an interaction with existing systems, both agent-based and more classical like client-server. A large part of agent systems is projected with a view to cooperation between heterogeneous agents. Agents from different systems can cooperate in order to exchange some information, services or jointly achieve some goals.

    The most significant agent standard is the one stated by Foundation For Intelligent Physical Agents (FIPA) which is a part of Institute of Electrical and Electronics Engineers (IEEE). The main goal of FIPA [3] is developing a set of standards concerning cooperation between heterogeneous agents originating form different agent systems. Among all interests in FIPA, those the most important are:

    • abstract architecture — in case when a number of systems using different technologies to achieve some functional purposes is going to interoperate, there is a need of defining fundamental elements of these agent systems,
    • management — specification for services concerning managing agents [4], some of them are:

    • Directory Facilitator (DF) — a yellow pages [1] service provided to other agents, agent can register in a catalog providing what type of service it is making accessible to other agents or query to find what services are offered by other agents,
    • Agent Platform (AP) — physical infrastructure where agents can be deployed, it consists of the machine with an operating system, an agent support software with agent management components and agents,
    • Agent Management System (AMS) — exerts supervisory control over the Agent Platform, it provides a white pages [2] service by maintaining agents’ AID (Agent Identify), each agent has to register with an AMS to get a valid AID,
    • Message Transport Service (MTS) — communication services between agents on different platforms;

    • communication — in order to provide understandable communication between heterogeneous agents FIPA proposes a semantic language (SL) [7] for messages recording and ontologies for providing vocabulary for representing knowledge.

    For purposes of communication between agents FIPA defines Agent Communication Language (ACL) [6]. Each of messages exchanged between agents consists of fields defining sender, receiver and message type (performativity), where only that last one, defining communicative act, is mandatory. The communicative act [5] is an agent’s action detailed by the message content. Those actions describes making requests, querying about inner state and performing negotiations (contact net). For the purposes of ACL messages content expression an SL language was defined [7] which with specified ontology defines syntax and semantics for the message.

    1.2. Multi Agent System as a Grid

    Grid concepts and technologies were initially developed to enable resource sharing within scientific collaborations. Those collaborations required to share not only databases but also software, computational resources and even some specialized instruments like telescopes and microscopes. Grids can be defined as systems enabling coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations [12].

    Virtual organization is a set of individuals and/or institutions defined by some sharing rules. Those rules consider sharing computers, software, data, services and other resources based on the resource providers and consumers defining what is shared, who is allowed to share, and the conditions under which sharing occurs [9].

    One of the typical and well known grid systems is Globus Toolkit. It allows for resources (data and computational power) management, its state monitoring, inter-nodes communication, providing security mechanisms and failure detection. It provides a set of services, protocols and interfaces supporting in a development of grid applications. Single application deployed in the Globus systems perceives the whole infrastructure as a local resources for which delivery to the specified node, responsible are platform level services [11].

    Historically grids were focused on interoperable infrastructure and tools for secure and reliable resource sharing within dynamic and geographically distributed virtual organizations where agent systems have focused on the development of concepts, methodologies, and algorithms for autonomous problem solvers that can act flexibly in uncertain and dynamic environments in order to achieve their aims and objectives [8].

    Some researches tries to connect grid and MAS paradigms leading to multi agent grid systems, sometimes simply called agent grids. The agent grid can be described by requirements at two levels: application and functional [14]. The application level defines requirements making it easier to build, maintain, scale, evolve, adapt and survive. Such a systems should be easily adaptable and scalable to large and small sizes and developed (evolved) by groups that do not need to know about each other. The functional requirements defines a unified, heterogeneous distributed computing environment in which computing resources are seamlessly linked. According to it agents can play the roles of applications whose computations can be distributed within the distributed computing environment, resources that can be used within this environment, and infrastructure components of this environment. Moreover agents can be used for performing load balancing, resources wrapping, and services broking.

    Because of lack of the autonomy in classics grids, those are mostly predictable units. One of the most important properties of the agents is their autonomy, which could bring unpredictability into the picture. It must be stated here that autonomy does not mean that agent can do whatever its wants, but should be able to act without coordination from the superior unit. Agents must be designed in a such way that they always act on the sake of the whole system and autonomy should be used in means of failures handling and load balancing.

    As an example of introducing agent into a grid, the AGrIP system can be mentioned. It introduces an agent environment in order to satisfy two requirements: (a) first, it integrates the resources and makes them available and useful, (b) second, it provides different kinds of agent grid common services [13]. The whole system is based on the MAGE, a multi agent environment for humanized systems which is compatible with the FIPA standard.

    The system introduces several agents types required for building grid:

    • DF — agent specified by the FIPA standard providing yellow pages service,
    • GISA — Grid Information Service Agent contains static and dynamic information about compute resources and network performance between them,
    • GRMA — Grid Resource Management Agent, provides capabilities to do remote job start and cancel as well as status checking,
    • GSSA — Agent Security Service Agent, provides agent grid security service,
    • DMA — Data Management Agent, responsible for access to remote data and its transfer management,
    • Agent — a fundamental agent combining one or more service capabilities into a unified and integrated execution model.

    AGrIP provides several toolkits using underlying agents on which top applications are built: Information Retrieval Toolkit, Data-Mining Toolkit, Case Base Reasoning Toolkit, Expert System Toolkit, Problem Solving Applications Toolkit, and Distributed Computing Toolkit.

    Slightly different approach can be seen in the solution integrating JADE agent based system with the Globus middleware. JADE (Java Agent DEvelopment Framework) is another agent system compatible with the FIPA standard [19]. It tries to solve some issues like: (a) complicated resource brokering and management in existing Grid middlewares, (b) lack of interoperability between individual middlewares, and (c) too high expectations put on the potential user of the grid [18]. As a solution it proposes software agents combined with ontologies. The key functions of the system are: (a) helping the user to contribute its resources to the grid, and (b) helping the user to execute the jog withint the grid.

    As a part of the solution, following agent types were introduced:

    • LAgent — an agent representing user, provides an intelligent interface between the user and the Brokering System,
    • CICAgent — an agent representing Client Information Center (CIC) being a central repository concerning existing agent teams,
    • LMaster — an agent representing particular team, responsible for preparing offer and utilizing negotiations with LAgents,
    • LMirror — an agent which mirrors/duplicates the LMaster agent in order to keep team functionality in case of LMaster failure.

    LAgent searches for an agent team in order to join it as a worker or request some job specified by the user it represents. All the negotiations are based on the FIPA contract net protocol. When the team capable of performing specified job is found, the LAgent sends a binary representation of the job to LMaster, which forwards its to one of the workers.

    Worker agent is equipped with a Job Executor module. It allows the worker to execute requested job. There were two concrete implementations prepared: (a) Simple Job Executor executing a job as a normal process within the worker machine and (b) Globs Job Executor passing the job to the Globus system.

    Both described systems introduce multi agents systems as a part of grid infrastructure making an use of agents as a service providers. Moreover in both systems, the usage of FIPA standard make it possible to introduce an easy integration with other system and modules.

    1.3. Multi Agents System as a Volunteer Application

    Volunteer computing is another form of distributed computing which makes an use of a large number of distributed peers connected to the system as volunteers. Typical functionality of such a system is that: (1) a server decomposes some long task into small jobs, (2), volunteers download jobs from the server, (3) jobs are executed, (4) results are returned to the server, (5) the servers composes the results [17].

    Most of the volunteer systems are centralized ones and their structure follows the star topology. This can be troublesome in cases when the main server is overloaded or some failure occurs.

    A PPCV system is an example of volunteer computing system made with distributed agents. The system is made with distributed volunteers (nodes), where each of them is an agent container, that is a place where agent can residue.

    Each of the nodes can have multiple neighbours, where neighbours are nodes with direct communication link between them. In the opposite to standard volunteer systems using star topology this one uses mesh topology which can be presented by complete or non complete graph (depending on how many connections were made during deploying new nodes). The PPCV implements are required volunteer actions, that is: (a) job decomposition, (b) job remote execution, and (c) result composition [22].

    System is made with the following agent types:

    • Scheduler Agent — responsible for job scheduling by managing particular node and communication with other ones,
    • Job Agent — responsible for executing requested jog.

    There is not specialized agent or system module responsible for decomposing jobs into smaller ones and sending them to particular agents. Instead of that if there is a need for decomposition, the Job Agent clones itself and negotiate with its copy about part of the job to be done by each of them. If there is still need for decomposition, those two copies can clone themselves independently and so on. This means that jobs passed to the PPCV system must be decomposable. As for the results composition it is made by merging cloned agents. In order to make an use of the great number of available volunteers cloned agents may migrate to another container, make computations and then come back in order to merge. The whole job scheduling inside the PPCV system is decentralized and realized by Scheduler Agents by communication between neighbours which on the basis of the perceived environment (node) make decision about accepting or passing job.

    1.4. Possible Usage of Agents in the Comcute System

    Above sections show that multi agent systems can be successfully used in high scale distributed systems. One of such a system is, still being in development, Comcute grid. In details it is grid system using computational power of volunteers [2]. It is characterized by the high reliability requirements and easy expansion by attaching new volunteers. In order to meet the requirements the distribution layer is divided into a number of equal peers instead of using one central server. In details the system was divided into four layers:

    • Z — layer of the service requester,
    • W — contains a number of servers responsible for dividing task and results gathering and verification,
    • S — proxy servers between W and I layers, responsible for communication with volunteers,
    • I — layer containing volunteers machines available while performing computations.

    There are some aspects where agents can be introduced in order to achieve new functionality or possibly improve performance. Typically, in volunteers applications where peers are not autonomic, there is no data size negotiations. Task divider sends specified amount of data and changes sizes of the next parts on the basis of volunteer computation speed. In situation when there would be an agent in I layer representing particular volunteer and an agent in S layer representing task divider there would be a place for data size negotiation depending on the volunteer’s system load and how much of its performance it is willing to share. Thank to that agent is perceiving its environment (volunteer’s machine system) and its preference concerning performance in order to tune data received from S layer.

    Agents implementation would also help in introducing payable services. This would require to implement agents in all the layers. Lets say that the requester from Z layer is willing to pay some amount of money for performing computations in the grid system. Then the grid owner is willing to pay its volunteers for sharing their resources. Firstly an agent from Z layer negotiates with the on from W layer in order to decide what is need to be done, with what performance (how fast) and how much it will be cost. A SLA containing those QoS parameters is done. Then agents from S layer negotiates with those from I layer in order to make a SLA concerning how much of volunteer’s performance will be shared and what will be the fee.

    In most of the volunteer’s systems there is a problem that particular volunteer can disconnect in any time. This causes that some computations are made on the same data by different peers in order to introduce reliability. In situation when there would be introduced fees for sharing computation power, as a part of the SLA there could be established some penalties for disconnecting in the middle of computations of particular data package. This could improve the efficiency of the whole system by reducing a number of backup computations.

    In the most complicated form, agents would be introduced in all four layers, but negotiations would be carried on only inside pairs of the Z-W and S-I layers. Because layers W and S belong to the inner system, there is no need for negotiations. That does not mean that there is no place for a SLA agreement in order to provide QoS parameters.

    While implementing agents, an usage of standards should be considered. The most commonly used is the FIPA standard, because of its standardization of communication between agents. It allows to express all requited communications acts, like: requests, querying for agents’ inner state and negotiations (contract net).

    References

    1. Alain Andrieux, Karl Czajkowski, Asit Dan, Kate Keahey, Heiko Ludwig, Toshiyuki Nakata, Jim Pruyne, John Rofrano, Steve Tuecke, and Ming Xu. Web Services Agreement Specification (WS-Agreement). https://forge.gridforum.org/projects/graap-wg/, 2012-04-02.
    2. Jerzy Balicki and Jarosław Kuchta. Obliczenia rozproszone w systemach komputerowych o architekturze klasy grid. Wydawnictwo Politechniki Gdańskiej, Gdańsk, 2012.
    3. FIPA – Foundations for Intelligent Physical Agents. Standard Status Specifications. http://www.fipa.org/repository/standardspecs.html.
    4. FIPA – Foundations for Intelligent Physical Agents. FIPA Abstract Architecture Specification, December 2002.
    5. FIPA – Foundations for Intelligent Physical Agents. FIPA Communicative Act Library Specification, December 2002.
    6. FIPA – Foundations for Intelligent Physical Agents. FIPA Message Structure Specification, December 2002.
    7. FIPA – Foundations for Intelligent Physical Agents. FIPA SL Content Language Specification, December 2002.
    8. Ian Foster, Nicholas R. Jennings, and Carl Kesselman. Brain meets brawn: Why grid and agents need each other. In Proceedings of the 2005 conference on Towards the Learning Grid: Advances in Human Learning Services, pages 28–40, Amsterdam, The Netherlands, The Netherlands, 2005. IOS Press.
    9. Ian Foster, Carl Kesselman, and Steven Tuecke. The anatomy of the grid: Enabling scalable virtual organizations. Int. J. High Perform. Comput. Appl., 15(3):200–222, August 2001.
    10. Qiang He, Jun Yan, Ryszard Kowalczyk, Hai Jin, and Yun Yang. Lifetime service level agreement management with autonomous agents for services provision. Inf. Sci., 179(15):2591–2605, July 2009.
    11. Bart Jacob, Luis Ferreira, Norbert Bieberstein, Candice Gilzean, Jean-Yves Girard, Roman Strachowski, and Seong (Steve) Yu. Enabling applications for grid computing with globus. IBM Corp., Riverton, NJ, USA, first edition, 2003.
    12. Carl Kesselman and Ian Foster. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, November 1998.
    13. Jiewen Luo and Zhongzhi Shi. Distributed system integration in agent grid collaborative environment. In Integration Technology, 2007. ICIT ’07. IEEE International Conference on, pages 373 –378, march 2007.
    14. Frank Manola and Craig Thompson. Characterizing the Agent Grid. Technical Report 990623, Object Services and Consulting, Inc., June 1999.
    15. Eric Newcomer and Greg Lomow. Understanding SOA with Web Services (Independent Technology Guides). Addison-Wesley Professional, December 2004.
    16. OASIS. Reference Model for Service Oriented Architecture. http://www.oasis-open.org/committees/download.php/16587/wd-soa-rm-cd1ED.pdf.
    17. Luis F. G. Sarmenta. Volunteer Computing. PhD thesis, Massachusetts Institute of Technology, 2001.
    18. Mehrdad Senobari, Michal Drozdowicz, Marcin Paprzycki, Wojciech Kuranowski, Maria Ganzha, Richard Olejnik, and Ivan Lirkov. Combining a jade-agent-based grid infrastructure with the globus middleware initial solution. In Proceedings of the 2008 International Conference on Computational Intelligence for Modelling Control & Automation, CIMCA ’08, pages 895–900, Washington, DC, USA, 2008. IEEE Computer Society.
    19. Telecom Italia Lab. Java Agent DEvelopment Framework Documentation. http://jade.tilab.com/doc/index.html.
    20. Michael Wooldridge. Introduction to MultiAgent Systems. John Wiley & Sons, June 2002.
    21. Michael J. Wooldridge and Nicholas R. Jennings. Intelligent agents: Theory and practice. The Knowledge Engineering Review, 10(2):115–152, 1995.
    22. Zhikun Zhao, Feng Yang, and Yinglei Xu. Building P2P Volunteer Computing System Using Agents. In Computational Intelligence and Software Engineering, 2009. CiSE 2009. International Conference on, pages 1 –5, December 2009.