ABSTRACT: Parallel processing in a computer enhances performance for high speed computing and it can be carried out by using many techniques and architectures at software and high hardware level. Performance optimization using hardware techniques may include the use of multiple computing nodes or a single node consisting of multiple processors. Symmetric multiprocessor is one of the modern architectures used to perform far-reaching computations. Symmetric multiprocessors have many configuration approaches to carry out these substantial computations. The performance of Symmetric multiprocessors is analysed and compared with high reliability models. Processor models are used to design and construct the architectures of symmetric multiprocessors. In this research paper, critical design aspects of symmetric multiprocessors have been analysed for further enhancement of the existing technology.

KEYWORDS: System Performance, Symmetric Multiprocessors, Clusters, Parallel processing, Process scheduling, Speedup, Cache Coherence

I. INTRODUCTION

The demand for the processing speed is growing at an incredible rate. The capability of execution according to speed and efficiency can be increased by different kind of ways like enhancing the CPU programming like inserting new programming, arrange new registers to the model of microprocessors and grouping up the CPUs[9]. Chip improvements are required for the first two options but the third can boldly increase the processing power. However, the “CPU grouping” approach is affordable because:
- If we enhance the CPU programming, more efforts would be required to integrate the programs and registers.
- If one processor is faulty, the life of the computer would be increased by multi processors.

Hence, we have a choice, to rely on internal changes of the CPU or we combine multiple processors/CPUs. Symmetric multiprocessing is a case of parallel multiprocesing [5], [11]. The symmetric multiprocessing system consists of multiple similar processors within the same computer. The processors are interconnected by a bus or some type of switching mechanism. Each processor has access to a common memory including its own cache. All processors behave identically and the kernel of operating system can assign any process to any processor. A Single instance of the operating system manages all processors. Applications have uniform access to memory and I/O. These operating systems are more special and complex unlike typical operating systems.
multidimensional. The scheduler allocates processes to the CPUs to execute it. This complicates the processing paths and signals of multiprocessors. Thus, efficient multiprogramming is required to avail the full and maximum processing. Symmetric processors have their own front side bus that’s why they have the advantage over cores.

The scalability of symmetric multiprocessors can be increased by using mesh architecture. SMP is one of the earliest types of computer architecture and mostly used up to 8 processors. These multiprocessors share a common main memory and I/O. A microcontroller controls data flow throughout the processors and main memory [12]. Each processor has a dedicated cache for better latency and data brought into each processor’s registers can be transferred through its cache rather than from main memory. The question arises here that may be a process on data can be cached by multiple processors. To avoid this incidence there is an update policy called cache coherence that ensures each processor is working on recent copy of data. The basic architecture we use for coherence is snoopy bus architecture (discussed later).

II. SMP CLUSTERS

A cluster is a group of interconnected whole or complete computers working together as a unified computing resource and can create the illusion of being a single more powerful machine. Each complete computer can run on its own, apart from the cluster. In other words a computer cluster is a team of linked computers forming fast local networks. Clusters are usually used to increase the speed and performance over single computer. They are cost effective and available to high performance computing. They are operated on having redundant CPU nodes [12], [8]. The capability to control more clients by giving more jobs and data access is done by scaling server side processor. We can have cluster of shared memory like: Each node has its own local memory, and nodes share data by passing data over the network. Client computers bonded to clusters of SMP server have given the computing power of Divide and Conquer Algorithms. Clusters provide the scaling of I/O, processors, and storage but not of client management methods, or security. Grid’s computing domain is scaling. Grid computing provides services like client management or security. With some careful analysis on SMP nodes and cluster architecture, we can scale these systems precisely and with very limited waste of resources. Also the communication between cluster nodes is much greater than that of between multiprocessing in SMP [12].

A. SMP Software features

In multiprocessors all processors use the same order of rules just like in a single processor system. Factors that may hinder speedup in a multiprocessor system include, among other, conflict over memory access, conflict over communication paths, inefficient algorithm for implementing the concurrency of the processors. The thread which comes out of this rule of processing, are globally analyzed and reordered with respect to each other in the shared area [4]. The thread visits the processors as multithread. So multiprogramming is required to produce multi threads. If the operating system does not partition the threads in multi-way multiprocessors, then it is better to use a uni-processor instead. In fact it could be a worse situation, because it may suffer more locking overheads and process delays when dispatched to other processors, it may be slower. There are different ways to achieve parallel threads execution of a single program:

i. Make parallel calls to library subroutines to create parallel multi threads that can run at a same instance.

ii. Execute the program with a parallelizing compiler. It will help us to detect threads that are not dependent on other thread that is to be executed on second instance and generate a parallel multithreaded parsing code.

iii. Use multithreaded software.

The maximum improvement can be analyzed and achieved by a rule that is called Amdahl’s Law: It says increase of speed can be achieved by a formula equals to uni-processor time
divided with the sum of sequence time and
time of multi-processor.
A parallel program has a sequential part and a
parallel part, the proportion for both of them
are \( \beta \) and \((1 - \beta)\)
Assume the total execution time for a single
processor is \( T = \beta T_1 + (1 - \beta) T_1 \)
The total execution time for \( p \) processors
would be
\[
T_p = \frac{\beta T_1 + (1 - \beta) T_1}{p}
\]
Speedup (\( p \)) = \[
\frac{1}{\beta + \frac{(1 - \beta)}{p}} = \frac{p}{\beta p + 1 - \beta}
\]
\( \beta \)
If \( f \) is an inherently sequential fraction of
execution time, then
\[
\text{Speedup} \leq \frac{\beta}{f + \frac{(1 - f)}{p}}
\]
Corollary: Maximum speedup = \( 1/f \)

III. OPERATING SYSTEM
SCHEDULING

A SMP unit has a similar view of the memory;
any task has the capability of running on any
processor. But in fact, it is not a correct way to
let a task wandering between different
processors to be processed [9]. When a thread
migrates to a processor, the data which is
currently present in first cache of that processor
also has to be moved towards the other
processor. So each processor can have its copy
and can be replaced again on the memory. This
is handled by the cache coherency. If we
introduce the processor property to each task
and bound that task to execute on the same
processor, this method is known as processor
affinity. But this method is not suitable for
locking because for example a task is running
on that processor and suddenly blocked. The
task which is already waiting in a queue and
ready for execution would suffer extra time [9],
[10]. Therefore modern SMPs can assign any
task to any processor. In Windows NT there are
no separate schedulers. A thread produces
events. These events are handed over to the
event handler modules of the windows kernel.
Events Like creating a new task, task asleep,
blocking of tasks on synchronization and task
terminator. Windows scheduling is totally
based on the time quantum mechanism. Each
task has a time period. It is a time in which
operating system checks the task priorities. To
do task switching there is time tick period. The
tick is custom set to 10 ms for uni-processors,
and 14 ms for SMP. On single tick, time is
decreased by 3. When time period value
reaches zero, task will be put away and recalled
again soon.

A. Seven States Model for OS
Processes/Threads

Normally, Windows NT has “32” priorities.
"0" priority is for idle process. Windows may
keep on changing priority to avoid starvation,
indefinite postponement deadlock etc. The OS
kernel maintains 1 queue ready for each thread
priority. There is a bit mask of (32 bit) which
tells which task is ready to be performed and if
its idle it tells scheduler that processes are idle.
If none of any processor is found idle, the
scheduler will preempt the lower priority task
on interrupt. Every processor has assigned
each task and the last processor on which it was
executed is saved. Another way can be that we
do process switching but the fact is process
switching costs more than thread switching. So
it’s better to divide the threads and allocate to
the multiple processor.
this for operations like Exchange, Compare and Addition etc.
A programmer must know how many locks should be created. For example if spin lock is created: A spinlock must not be recursive, as the processor would be continuously spinning on the lock with no one to release the lock. Too much poll will affect bandwidth and too low will delay. So remain balanced. The existence of multiple locks makes a deadlock possible. More than needed locks effects throughput. For example in case of mutual- exclusion locks, with 9 instances of a program running in parallel, the 9 instances would not be synchronized effectively to avoid waiting for other process.

IV. CACHE COHERENCY WITH NON–UNIFORM MEMORY ACCESS (CC-NUMA)
In symmetric multiprocessors every processor has its own cache, so the obvious possibility is that every cache has the same copy of data to be executed. If more than two threads modify the same data, it concludes with no data coherency [3]. The solution is to invalidate other copies of data except one, by broadcasting on the shared bus. Invalidation is performed by cache controller hardware [13]. Cache controller hardware watches flow of data over the bus. This method is known as snooping protocol. Directory based coherence protocol is another model [4], [7]. After invalidation, there are two methods to update main memory [10]:
Write through: In this method the controller updates the memory as soon as possible for the other processors after writing the data. Write back: In this method the controller doesn’t updates the memory cell unless another thread comes and demand for that cell. If one processor has demanded the same data in the memory, it is better to retrieve it from the cache of other processor. Main memory will take more time to recognize the recall.
Cache Coherency Solutions There are three states of cache blocks of Snoopy protocol: Shared (block is ready to fetch and read), Exclusive (block is ready to write and there are no other copies of it), Invalid (block has no data) [15]. These states are implemented when CPU demands for any cache block.

A. MESI Protocol
The snoopy protocol model is ideal for on chip supported caches. But in most of the small scale SMP’s MESI protocol model has been implemented. The MESI protocol is an example of scheme that employs snooping. There are four states of MESI protocol: INVALID, SHARED, MODIFIED & EXCLUSIVE.
B. Token Coherency Protocol

Since the message passing technique was difficult in direct connections, a new protocol model was designed for direct processor interconnections as well as for switched based interconnections in 2003. Token coherency technique simply uses counting and exchange of tokens. Each block is mapped with fixed number of tokens. Processor should have all the tokens in order to write a block but to read a block at least 1 token is required [1]. Encoding bits of tokens is done by the formula Log2N, where N = no. of tokens.

In this protocol processors can predict and ask other processor for the required token if it has. Token coherence model can perform 20 - 30% faster than the snoopy protocol. Counting of tokens gives safety to coherence invariant - single writer and multiple readers. If a processor failed to acquire data, a timeout message will be sent to requestor, and a persistent request will be sent. Request persists until it is satisfied and deactivated upon completion. It reduces starvation as well. Token snooping is more efficient in direct processors interconnection. The graph given below is between indirect interconnected SMP on Normalized runtimes.

C. System Scalability and performance

A parallel computer with n processors under normal circumstance should be n time faster than a single processor system. However, this is not always the case. The speedup can be much less than n. Scalability is the performance of Multiprocessors. As expected, adding more processors should increase the overall performance accordingly, for example two processors should increase the performance twice as compared to one processor. But in actual fact, adding processors increase the scalability so it reduces some as well. That is due to:

- Conflict over memory access
- Conflict over communication paths
- Inefficient algorithms for implementing the concurrency of the processors
- Cache coherency pipelining
- Time taken by number of cycles by spin lock
- Synchronization conflict

The fact is, if one processor provides 1 speed. Two processors provides 1.75 with an increase of 0.85 and eight processors will provide 5.2 - 5.5, according to Amdahl’s law, the speed and performance gradually decreases but not a big deal as compared to uni-processors. Thus scalability and performance can be increased by:

- Increasing memory band-width,
- Shortening the latency of memory access time (maybe we should design anew memory scheduling techniques and algorithms), and
- Reducing the gap between memory so that starvation could be less possible.

The Threads received by SMP is distributed to all processors equally. If we run Windows NT on a single CPU, as we know that threads are distributed by multithreaded operating system. The result of parallel threads can be shown by this graph.

![Fig. 7: Threads Behaviour on a Single Processor](image_url)
V. SMP Network Architecture

A. NUMA Architecture

SMP is a basic form of UMA (uniform memory access) architecture. The interconnection of SMP with other SMPs through Interconnection network(s) and switches forming the clusters are known as NUMA (non-uniform memory access). UMA is best for not more than 8 processors due to scalability issues. But NUMA or (Cache Coherence NUMA) makes it preferable due to its scalability for more than 8 processors, because every unit of processors have their own local physical memory which is easy to access but logically there is one address shared space.

Users prefer NUMA on multiprocessor architecture because they believe that programming is easier and due to non-required extra library of the compilers. The time required to access data depends upon its location, whether it is present in local memory or may be residing over remote memory. A single image of operating system runs all over the network. If one processor will modify any data the other processors will also update data into their cache.

Logical address space contains pages; these pages have some states which are passed to track the position.
- NO-PRESENCE: they are in remote memory
- SHARED: copies are distributed to local memories
- EXCLUSIVE: In local memory”

The latency for accessing data in comparison of both local memory as well as non-local memory is calculated by NUMA Factor. For example data access from the neighbour node is faster than the node which is present on a distant level. For NUMA factor we have to engineer different inter-connection design for the nodes. For example: Indirect fat tree, 2D torus, 4D hyper cube, Hierarchical Interconnection, Omega network, Ring, cross bar ... etc.

B. COMA Architecture

COMA (Cache Only Memory Architecture) is made for large SMP networks just like NUMA. In this structure memory are replaced by cache memory or we can say acting like cache (attraction memory). Their addresses are hashed to DRAM cache lines. Data is readable at any of modulo at any single instance and is moved by hardware. This ability of making copies, proved this structure extremely time effective. If the OS algorithms are poor, COMA compensates it, but it requires separate memory boards along with coherence interconnection memory board.

VI. CONCLUSION

Parallel and distributed computing has grown affectedly the last few decades to achieve the emerging requirements of computational far-reaching applications. There are different techniques to implement distributed computing environments. The most prominent method is to use symmetric multiprocessors architecture for achieving high performance distributed platforms. Symmetric multiprocessor architectures have an extensive ability to manage multiple real time threads for active applications. Although the current architectural scenarios being adopted for the design of symmetric multiprocessor architectures are demonstrating enough computational power, there is room for improvement through further research. Different issues including synchronization, pipelining, protocols, hyper threading, coupling of GPU and SMP and dealing with SMP clouds are all factors to consider for further research.
VII. REFERENCES


