Friday 9 May 2014

The End of University and Thoughts on the Future

Hey there, so now University has been officially completed! My honours project and viva voce have been completed, all exams are done and now it's just a matter of deciding what to do for the future. My marks and grades should come in the beginning of June and my graduation will be in July. In the meantime however I need to decide between two courses of action for the immediate future and that is either getting a job or post-graduate education. Currently I have received an offer for a funded PhD in Pervasive Parallelism at Edinburgh University which stands as a good contender, however on the opposing side getting a job in industry is an attractive option. Money and experience is always a good thing. As such these last couple of weeks have been a form of turmoil as I slingshot between the two but there's not much that can be done until I receive back all my formal offers. As such I shall post an update in the coming weeks.

However, in the meantime, now that university is over my goal is to continue honing the skills I've picked up over my time here. Alongside this I aim to finish many of the side projects that I haven't had time for over the past few weeks not to mention all the unplayed games in my steam library that need finishing. Currently I'm in the process of writing a neural network library, the goal of which is to make an optimised package that will integrate into your C++ project with ease to provide automatic fast neural networks. All you will have to deal with on your end is deciding network topologies and normalising the input data. It's nearly complete but most of the trouble is getting my head around the neural networks in the first place and making sure the classes actually work. I'll throw up a link to the github repo once I get version 1.0 working properly.

Anyways, apart from these things I'm back to writing and game design as well and will be looking into Unreal Engine 4 stuff as I'm hoping to learn that over the next few weeks. Not much else to say really other than it's been a fun few years at university but I'm looking forward to moving on to bigger and better things! 


Email: markmmiller@hotmail.co.uk
Xbox Live: Dr Death MK 2
Steam: 7thsanctum

Origin: 7thsanctum
Youtube: 7thsanctum
Github: 7thsanctum

Thursday 8 May 2014

Let's Play - Shadowrun Returns

What is it?
Shadowrun Returns is a single player role playing game with tactical turn based combat elements made by Harebrained Schemes. It's also a Kickstarter funded game released in mid-2013 based upon the Shadowrun table-top boardgame. Whilst it include a campaign, Dead Mans Switch, there is also support for the Steam workshop and comes with custom mission tools so that users can share and play each others campaigns. 



What is it about?
The main campaign included follows the story of your character investigating the mysterious death of a friend and trying to track down the killer in 2054 Seattle. The characters motivations are a combination of honour, greed and nothing better to do. The Dead Mans Switch story lasts around 11 hours and presents a good and well polished story with many interesting characters and conversation options. Whilst this included campaign is interesting and enjoyable, giving back a semblance of old noir style detective stories, the real gem is in the use of the Steam Workshop and included mission building tools to allow sharing of stories.



Combat in the game takes place in a turn based manner, similar to X-Com. The primary skills that are used are take the form of ranged, close and magical combat and are used dependent on your characters abilities. As such, a shaman is more likely to use summoning or magical attacks to fight enemies. 


However, forming teams of diverse runners is important for victory. Shamans can often boost accuracy and provide other shield bonuses for team-mates, whilst deckers can hack into the matrix to override sentry guns and other systems. Alongside this, deckers engage in combat with AI systems and other deckers within the matrix and combat takes part in a similar manner to the physical world. 

As the player levels up, using karma, they can upgrade their stats to provide combat bonuses in different skills. These include the ability to use advanced decking equipment, improved weapon accuracy, better conversation options and more. 


Where can I get it?
Shadowrun Returns is available for PC(Mac and Windows), iOS, Android and Linux. Currently it is available on the Steam and Humble store pages. There is also a sequel out called Dragonfall which is available as DLC. Extra campaigns and stories are available via the steam workshop. I am currently playing through Shadowrun Unlimited which provides a very interesting campaign so far and can be found here.


What do you think?
Overall Shadowrun Returns is a very interesting RPG game, the included campaign is good and provides enough to whet your appetite however the inclusion of steam workshop and custom content support is what makes this game worthwhile. Already campaigns such as Shadowrun Unlimited and Jacked-Up have provided extended hours of gameplay and interesting and novel additions to gameplay functionality such as safe houses. Although this is the case, it would've been nice if more official campaigns were included with the base game.


The game itself follows closely to the original Shadowrun lore and ties the campaign into the mythos aided by the fact that Jordan Weisman, creator of the original Shadowrun started this project. This merger of cyberpunk and fantasy also makes for a very interesting setting which I personally enjoy a lot. Apart from this, the in-game artwork and background environments are beautiful and further emphasise the dystopian nature of the setting. The lack of voice recordings is offset by the (for the most part) well written story and appropriate soundtrack.

Rating : 8 / 10



Email: markmmiller@hotmail.co.uk
Xbox Live: Dr Death MK 2
Steam: 7thsanctum

Origin: 7thsanctum
Youtube: 7thsanctum
Github: 7thsanctum

Monday 14 April 2014

Honours Project - Poster Presentation, Dissertation Writing and the Final Push!

My honours project and final year of my bachelors degree are coming to a close. The final submission data is the 28th of April but my goal is to have everything finished up a tad earlier than that. Hopefully by the 21st everything shall be finished and prepared for printing, then a few exams and finally out into the big bad world of post-graduate employment or further education!

Anyways, this post is a quick update of the final submitted status of the project. I won't have time to add more in before the projects finale but will no doubt continue to tinker and improve it over summer. Firstly, the poster.

Figure 1 - Final Poster! (Click to Enlarge)
So I presented this poster last week at the poster presentation sessions and, for the most part, everyone understood the purpose and goals of the project. As for the production of the final project, a fair amount of progress was made and the results were also fair. Further improvements could have been gained from implementing sparse voxel octrees, however this is something I'll have to add in the future.

 Some of the interesting images and effects that were gathered over the project can be seen below.

Figure 2 - Simplex Noise Data
 The above blobby terrain was a simplex noise structure, rendered, in this case at 128^3 resolution.

Figure 3 - Floating Rock
One of the more interesting structures and prime inspirations for this project was the floating rock structure made by Florian Boesch in this post here
Figure 4 - Perlin Noise Landscape
Perlin noise was also implemented to generate landscapes, much like in games such as Minecraft.

Figure 5 - Cut out of Sphere
A few other interesting effects were found, included after the optimizations of certain structures such as the sphere, which created interesting geometric patterns on the inside.

The main purpose for this project was the use and implementation of geometry shaders to prove their effectiveness for a task such as this. Whilst the engine produced worked well and could render things efficiently, a more scientific method would be required to prove how effective it is in comparison to other techniques.

Anyways, the main task for just now is getting the dissertation sorted and writing up about all the findings. On an unrelated note, saw this awesome project that generates colourful triangle based backgrounds using D3. Time to get back to the grind...


Email: markmmiller@hotmail.co.uk
Xbox Live: Dr Death MK 2
Steam: 7thsanctum

Origin: 7thsanctum
Youtube: 7thsanctum
Github: 7thsanctum

Tuesday 11 March 2014

Honours Project Update - DirectX

Hey again, it's been a while since I've written about the status of my honours project and I've added a few features over the past few weeks.

I've added in some new data to give more interesting renderings compared with the old solid rainbow block. The main ones are a perlin noise heightmap using libnoise and a simplex noise dataset. Click on the pictures to read the text!
Figure 1 - Heightmap (Perlin Noise)

In the screenshot below, approx 1.5 million cubes are being drawn at above 40fps consistantly.

Figure 2 - Simplex Noise (192^3)
I have also added multi-texturing but still need to modify my application to make better use of the type distribution. This allows the engine to currently render different block types with different textures. 


Currently, the optimisations I have in place only remove inactive and fully surrounded cubes from being rendered, because of this I get a good performance for fixed datasets that do not change. However to update it or perform naive frustum culling it would slow it down considerably as my checks do this against every cell in the system. My next step is to fix this using the power of octrees! This is why in the gif above the number of cubes being rendered does not decrease as the camera moves. All 1.5 million of these points are being sent to the GPU and drawn into cubes using the geometry shader! 

A couple of other minor things I've added to my engine, frustum culling (although I need to combine this with multi-threading and octrees before it becomes effective), back face culling in the geometry shader (which has reduced the number of vertices I need by half), text rendering and extra controls. 

Main things I need to work on now are my octree implementation and improvements to my lighting model, I also need to gather more concrete performance data from other engines to find out whether or not I'm managing to improve on anything. However, I'm confident my use of the Geometry shader should provide similar performance benefits, if not greater, to the usage of vertex buffer objects once I manage to get octrees implemented. The geometry shader is a powerful stage and has allowed me to render over 11 million cubes without optimisation. Granted this is at a measly 10fps on a GTX Titan, but it does show that with the correct optimisation in place it can render large datasets well. The task now is filtering the data before it reaches the GPU. The dataset below is a solid 224^3 cube with all cells active and all cells being drawn, no filtering is performed.

Figure 3 - Current Max number of cubes (11 Million)


Anyways check out the video below to see a slightly older version of the application with consistently updating heightmaps.



Email: markmmiller@hotmail.co.uk
Xbox Live: Dr Death MK 2
Steam: 7thsanctum

Origin: 7thsanctum
Youtube: 7thsanctum
Github: 7thsanctum

Tuesday 11 February 2014

GTX Titan Upgrade, Dogecoin and a New Domain Name!

A couple of interesting things, firstly I have registered the markmmiller.co.uk domain name so this blog can now be found at blog.markmmiller.co.uk. As well as this my main site is still a work in progress but I have been experimenting with Linux and have set up a Raspberry Pi to act as a web server for hosting. Again the main site at www.markmmiller.co.uk but if you head over the site is being hosted from my Raspberry Pi as we speak! 

In other news I applied for the Nvidia Academic hardware donation program a few weeks back for a GPU to help with my research into parallel programming and to much surprise I managed to get approved for a GTX Titan. The Titan finally arrived today and, for once, managed to install it without too much bother.

Figure 1 - GTX Titan

Figure 2 - GTX Titan
This was a significant upgrade to my ageing GTX 460 so hopefully I will be able to levy a lot more performance in my applications and I will attempt to benchmark both give a comparison of the performance gap between the two. Although the Titan isn't a mid-range card, hopefully it is an indication of the levels of performance that will become mainstream in years to come.

Figure 3 - MSI GTX 460
The other thing I have recently embarked upon is altcoin mining, mainly Dogecoin for now. With my GTX 460 I managed to achieve a 120 K/hash rate but the Titan on the other hand doesn't work with my app so I shall update when I find out. As for the temperature, both the Hawk and Titan are running around the same temperature idle but currently the Titan is having trouble turning its fans on when it gets hot. I will look into this later. Anyways, I shall get to bench marking soon but first is a matter of installing the programs. Check out the photos below to see what it's like in the case!

Figure 4 - Case Interior

Figure 5 - Titan Closeup
Be sure to click on the above photos to see them in HD. Anyways, back to more University and Honours project work for me, I'll update you guys soon!

Email: markmmiller@hotmail.co.uk
Xbox Live: Dr Death MK 2
Steam: 7thsanctum

Origin: 7thsanctum
Youtube: 7thsanctum
Github: 7thsanctum

Thursday 30 January 2014

Global Game Jam 2014 - Face/Off and lessons learned

Over the weekend  I participated in Global Game Jam at the Edinburgh jam site. The goal was to make a game in 48 hours, starting at 5pm on the Friday with a deadline of 5pm on Sunday. The secret theme this year was a quote from Anais Nin, "We don't see things as they are, we see them as we are". After the theme was revealed we had some time to band together with everyone else at the event and come up with some pretty interesting and wacky ideas. Some people decided to make platformers where everything in the world was not as it seemed, others worked on an Oculus Rift horror game and several awesome ideas were thought up and discarded quite early on. 

One of the ideas I pitched was a game where you play in a standard FPS view against other players and the goal is to eliminate the other players. The main feature though was that at the start of the game the views of all the cameras are switched so that you would then be looking at another players view but you would still be in control of your own body. Through this 2nd person view point you then had to kill the person who had your eyes to retrieve them. Although this idea sounded quite interesting at the time it proved to be very complicated to play initially for many of the play testers. The gameplay itself when it worked was very good, mainly when two players who were trying to eliminate each other were looking at themselves and trying to take out their own view port. Sadly more often than not players would end up wedged in walls and completely stuck.

In the image below, Player 1 would be looking at the top left quadrant which is the camera view of another player. Their objective is to then work out which of the other bodies they are controlling and finally eliminate the player whose eyes they are looking out of.
Figure 1 - Original Game Idea

On the final night however we decided to add some alternative experimental game modes as we managed to get the initial game completed quite early. The first of these was a crowd hunting game, you play as a robot, mingling hiding in a crowd of other robots and your job is to find the other human player in the game who is also a robot and is also attempting to hunt you! This idea came about from one of the original ideas we had for the game which was that of crowds of NPC robots making it more difficult to spot yourself. After implementing them however we realised that it was actually more fun trying to spot others at all when people were trying to hide and blend in with the robots who were in the crowd.

Figure 2 - In-game Screenshot
Overall many lessons were learned over the course of the jam. The main one was that humans cannot survive  (and functional optimally) for 48 hours on purely chocolate and water. The other was that, it doesn't matter if your idea changes from what it was initially and new ideas that you come up with along the way shouldn't necessarily be restricted.

Anyways, although what my team and I made was not the most polished looking game it was a pretty interesting concept that we managed to make up in a relatively short period of time. Check out the video below to see a game in action.


Credits:
Mark Melville Miller (7thsanctum)
Sam Serrels (dooglz)
David Strachan (Halcyon)
Kenneth Benzie (infektor)

Our game jam page can be found here : http://globalgamejam.org/2014/games/face


Email: markmmiller@hotmail.co.uk
Xbox Live: Dr Death MK 2
Steam: 7thsanctum

Origin: 7thsanctum
Youtube: 7thsanctum
Github: 7thsanctum

Saturday 18 January 2014

Distributed Parallelism for N-body using MPI, C++11 threads and SIMD

Over the final few weeks of the last trimester, for one of my modules I have been doing research into different methods of parallelism, how effective they are on performance and generally whether or not I can get much of a speed up with them. This post will discuss the algorithm used, optimisations performed and the specific results into this test. Overall it was found that combining multiple different paralellisation together (MPI & C++ threads & SIMD) produced the fastest speed up although this was not nearly as fast as the GPU compute shader version that was tested.

If you like percentages and data this is the post for you. I will make a post later and link to relevent articles on how I went about optimising these different versions another day.

Algorithm


The algorithm I tested was a simple N-body with gravity. The serial version that was used as a baseline and tested against the other versions of the applications was based on the version taken from the DirectX SDK and as such uses certain DirectX data structures although these are unnecessary. The algorithm is as follows.

// Body to body interaction, acceleration of the particle at 
// position bi is updated
void bodyBodyInteraction(D3DXVECTOR3& ai, D3DXVECTOR4& bj, D3DXVECTOR4& bi, 
                         float mass, int particles ) 
{
 // Get the vector between the two points
 D3DXVECTOR3 r = D3DXVECTOR3(bj.x - bi.x, bj.y - bi.y, bj.z - bi.z);

 float distSqr = dot(r, r); // Find the dot product of the vector
 distSqr += softeningSquared;

 float invDist = 1.0f / sqrt(distSqr);
 float invDistCube =  invDist * invDist * invDist;
 float s = mass * invDistCube * particles;

 // Update the acceleration with the calculated value
 ai = D3DXVECTOR3(ai.x + (r.x * s), ai.y + (r.y * s), ai.z + (r.z * s));
}
// Call this function to calculate the next position of all the bodies
// in the system, represents a single pass
void calculateGravity()
{
 // Loop through every particle in the system
 for(int i = 0; i < MAX_PARTICLES; i++)
 {
  D3DXVECTOR4 pos = pData1[i].pos; // Stash a local copy of
  D3DXVECTOR4 vel = pData1[i].velo; // the position and velocity
  D3DXVECTOR3 accel = D3DXVECTOR3(0, 0, 0);
  float mass = g_fParticleMass;

  // Calculate acceleration to be applied to the
  // using all the other particles in the system
  for (int tile = 0; tile < MAX_PARTICLES; tile++)
   bodyBodyInteraction(accel, pData1[tile].pos, pos, mass, 1);

  // Update the velocity and position of current particle using the 
  // acceleration computed above
  vel = D3DXVECTOR4( vel.x + accel.x * 0.1f, 
                                        vel.y + accel.y * 0.1f, 
                                        vel.z + accel.z * 0.1f, 
                                        vel.w);   
  vel *= 1;     // damping
  pos += vel * 0.1f;   // apply deltaTime 

  pData2[i].pos = pos; // Save the newly calculated position and velocity
  pData2[i].velo = D3DXVECTOR4(vel.x, vel.y, vel.z, length(accel));
 }

 pData1 = pData2; // Swap the buffers
}

The above algorithm represents that simple sequential version used for the tests, the data structures can be changed for arrays of floats, vectors or anything else but the underlying theory is the same. For each particle in the system, calculate the gravitational effect on it based upon every other body in the system. The bodyBodyInteraction function above calculates the acceleration effect upon one body based upon another, whilst the actual acceleration is applied after this has been calculated in the calculateGravity function. It is then applied and stored in pData2 which is a temporary list of all the new positions so that later particles in the list can still refer to the old positions. After all the calculations are complete we can then switch pData1 and pData2 and then loop the process ad infinitum.

Optimisations


I optimised this algorithm for multiple versions using C++11 threads, MPI and SIMD with a comparison against a version with DirectCompute shaders found in the DirectX SDK. Since this problem is primarily a data parallel problem I simply modified the algorithm slightly so that I could split up the data set to work across multiple cores or machines. SIMD required much more modification of data structures and instructions to properly function. The change can be seen below.

void calculateGravity(int start, int end)
{
 for(int i = start; i < end; i++)
 {

The algorithm will now work by specifying a range of values to work upon in the range of data. This is useful for the multi-threaded approach where thread 0 will do the first group of values, 0 - 255 for example, then thread 1 will do the second group, 256 - 511. This follows the parallel data and sequential processing model.

SIMD

SIMD (Single Instruction Multiple Data) by far required the biggest changes through the code, the main things that were changed were the use of the D3DX11VECTOR types which were changed to 32 byte  __m128 values instead. Luckily the position and velocity information for each particle fit perfectly into these types as they are also represented by 4 floats each. The other primary change was the use of SIMD operations. Since there were a large number of changes made more details on these changes will be outlined in another post.

// Body to body interaction, acceleration of the particle at position bi is updated
void bodyBodyInteraction(__m128& ai, __m128& bj, __m128& bi, float mass, int particles ) 
{
 __declspec(align(16)) __m128 r = _mm_sub_ps(bj, bi);
 
    float distSqr = dot(r, r);
    distSqr += softeningSquared;

    float invDist = 1.0f / sqrt(distSqr);
 float invDistCube =  invDist * invDist * invDist;
    
    auto s = _mm_set1_ps( mass * invDistCube * particles ); 

 ai = _mm_add_ps(ai, _mm_mul_ps(r, s));    
}

In the above code the distance between bodies is defined as "__declspec(align(16)) __m128" this means that we are using a memory alignment of 16 bits on our __m128 data type. This data type is used for the Streaming SIMD Extension (SSE) instructions.

The first instruction used in this section is "_mm_sub_ps(bj, bi)", this simply subtracts the values of bi from bj. With these SSE instructions we can perform the subtraction of each part at the same time on the CPU instead of one after the other. This should give us a theoretical speed up of four times for this single calculation.

The other instructions used in this section were "_mm_add_ps" which adds two __m128 together, "_mm_mul_ps" which multiplies two __m128s and "_mm_set1_ps" which simply sets all the values in an __m128 to the value defined.

By altering all of these vector operations to use SIMD we can make use of the hardware support on the CPU to speed up the time taken for a single operation. Whilst the speed gain for a single operation may be relatively negligible, we are doing them multiple thousands of times resulting in a compounding speed saving. It was interesting to note that despite both Intel and AMD supporting SSE instructions only the Intel CPU received a speed increase, the application ended up running slower on AMD hardware. This highlights some of the difficulties of using SIMD as you must take in account specifics of the CPU you are programming for. Memory alignment and how the individual CPU processes the instructions can result in differing level of speed increases. Improving the speed on the AMD machine was beyond the scope of this project but I have no doubt that it is possible.

C++11 Threads


For the C++11 threads we create the threads as follows.

// split up work for number of threads in system
for(int j = 0; j < num_threads; ++j)
{
 threads.push_back(thread([=](){ calculateGravity(start, end); }));
}
for(auto& t : threads)
 t.join();
pData1 = pData2; // Since it's multi-threaded we do the 
   // buffer switch after all threads are complete
}

The main changes are that now we must switch the data buffers after all threads have been rejoined as opposed to inside the function due to the parallel approach we have taken. The C++ threaded application simply splits up the data amongst each of the availible hardware threads for processing then rejoins at the end. As such the amount of speed increase is dependent upon the number of hardware threads available.

Since this application makes use of the number of cores on the CPU, this could be combined with the SIMD level parallelism and as such a version of this application was also produced taking into account the changes made by the SIMD application. It was found that this version did offer a speed up over the threaded or SIMD version alone.

I'll write a more detailed post about C++11 threading and how to do so and put a link here when it is done.

MPI

MPI (message passing interface) allows distributed applications to communicate across networks to each other. The main communication methods that were used for this application were "MPI_BROADCAST" and "MPI_SENDRECV" these two processes allowed the application to communicate the data across all machines and then have each machine send it back to a host machine.

Using MPI, multiple instances of the application can be running across different computers in a network, each one has an ID, with the host as 0. As such for these applications, all of the data was sent out to each machine on the network. Each machine that is not the host machine will wait until it has received its data before it goes on to process it, the host machine will then start sending the data to all the machines depending on the communication method being used. "MPI_BROADCAST" sends the data specified to all machines in the network and requires all of the nodes to be listening and call it at the same time. The other method is "MPI_SENDRECV" which acts as a point to point communication between two nodes. For the version of the application created I instead used "MPI_SEND" and "MPI_RECV" the distinction being that these are uni-directional. One process on the system will call "MPI_SEND" and the specified node must then at some point call "MPI_RECV", if either node is not ready to send or receive then that node will wait. This was used for the communication of the newly calculated positions back to the main node.

The code below shows the point that decides which part of the application is run by the node itself. "my_rank" is the nodes rank within the communication network, a rank of 0 belongs to the host node. As such, if you are not rank 0 then you are worker node, in this manner the stuff within the first if statement is performed only by the worker nodes. 

if (my_rank != 0)
{
   // Not main process - wait to receive message
   PARTICLE* message = new PARTICLE [ MAX_PARTICLES ];

   // Wait to receive data from host node
   MPI_Bcast(message, MAX_PARTICLES * 8, MPI_FLOAT, 0, MPI_COMM_WORLD);
   pData1 = message;
     
   calculateGravity(0 + (divide * my_rank), divide + (divide * my_rank));
   pData1 = pData2;

   message = new PARTICLE [divide];

   for(int k = 0; k < divide; ++k)
      message[k] = pData1[k + (divide * my_rank)];

   MPI_Send(message, divide * 8, MPI_FLOAT, 0, 0, MPI_COMM_WORLD);
}

The below else statement is what is performed by the host node. The primary differences are that it needs to initialize the particle system, broadcast the data out to all the nodes, calculate a portion of the data and the finally rejoin all of the calculated data from across the network.

else
{
   InitiliaseParticles();// Initilise the groups of particles
   
   // Send out the list of particles to all nodes
   MPI_Bcast(pData1, MAX_PARTICLES * 8, MPI_FLOAT, 0, MPI_COMM_WORLD);

   calculateGravity(0 + (divide * my_rank), divide + (divide * my_rank));

   pData1 = pData2;

   for(int k = 1; k < num_processes; ++k)
   {
      PARTICLE* message = new PARTICLE [ divide ];
      MPI_Recv(message, divide * 8, MPI_FLOAT, k, 0, 
               MPI_COMM_WORLD, MPI_STATUSES_IGNORE);

      for(int x = 0; x < divide; ++x) 
      pData1[x + (divide*k)] = message[x];
   }
}

The MPI version of the application was combined and tested with all other versions of the application. Since MPI is just a method of communicating between computers this meant that each computer itself could still benefit from the other parallelisation techniques to make use of the local hardware. Of course, in this instance the biggest performance factor was the speed of the network.

It was interesting to note that the application seriously slowed down for the smallest of data sizes, this was due to the transfer time of the data being higher than the time taken to process it on a single machine. As such it is important to take this into consideration when designing your application.

More details into MPI and their functions will be outlined in a post in future. 

Test Results

The main applications that were tested were as follows:
·        Sequential N-Body
·        Sequential N-Body – With SIMD
·        Parallel N-Body – C++11 Threads to split up data
·        Parallel N-Body - C++11 Threads (with SIMD)
·        Parallel N-Body – Compute Shader (GPU SIMD )
·        Parallel N-Body – MPI with Multiple computers (Sequential)
·        Parallel N-Body – MPI with Multiple computers (Parallel)

The applications were tested on two different systems, one Intel based and the other AMD.


CPU Name
Year
CPU Speed
Cores
GPU
NM
Cache
Computer 1 - Home
AMD Phenom II x6 1055T
April 27, 2010
3.25GHz
6
GTX 460
336 CUDA Cores
1GB GDDR5 @ 1800 MHz
Computer 2 – University
Intel i5 2500

Q1'11
3.3 GHz
4
GTX 550 Ti
192 CUDA Cores
1 GB GDD5 @ 1,026MHz

The cluster used for each of the MPI tests were combinations of 2, 4, 8 and 16 computers all with the same configuration as Computer 2 in the above table connected over a 100 Mbit network.

Sequential

The baseline speeds for each computer running the sequential version of the application was as follow:

Serial
1024
2048
4096
8192
16384
32768
65536
Computer 1 (ms)
78
328
1359
5484
22013
88429
357092
Computer 2 (ms)
67.75
250
992.75
3961.5
15839
63843.5
254323

As can be seen in the above results the differences between Computer 1 & 2 are minor and there is no difference between the two as the data size increases. The times from this algorithm will form the base case to compare against the other optimisations later on. It is interesting to note that Computer 1, despite having 6 cores, has no performance increase over Computer 2, this shows clearly that there is much room to make use of the parallel hardware available in the system.

For the following results, speed up shows how the speed of the parallel application ran compared with the sequential version. A speed up of 100% in this case 


Sequential with SIMD

The first optimisation that was tested was the SIMD version of the application. The test results and speed up versus the baseline version are shown below.

SIMD
1024
2048
4096
8192
16384
32768
65536
C1
78
328
1359
5484
22013
88429
357092
C1 (w/ SIMD)
93
375
1527
6077.75
23982
96716
387619
Speed Up
-16%
-13%
-11%
-10%
-8%
-10%
-9%
C2
67.75
250
992.75
3961.5
15839
63843.5
254323
C2 (w/ SIMD)
43.25
168.75
676.25
2694.5
10655
42562.5
170354
Speed Up
57%
48%
47%
47%
49%
50%
49%

As we can see from the above results, on the AMD based Computer 1 there was actually a reduction in performance when using these specific SIMD instructions. Computer 2 on the other hand received a moderate speed boost. Since SIMD in this case is performing 4 operations at the same time it is expected that the maximumm possible speed boost would be 4 times faster. This shows that more work is needed to realise the full potential of SIMD on the CPU for this application.


C++ Threads

These are the results for the application that used just threads by themselves.

C++ Threads
1024
2048
4096
8192
16384
32768
65536
C1
78
328
1359
5484
22013
88429
357092
C1 (C++ Threads)
15
62
265
953
3734
14968
59709
Speed Up
420% 
429% 
412% 
475% 
489% 
490% 
498% 
C2
67.75
250
992.75
3961.5
15839
63843.5
254323
C2 (C++ Threads)
19
80.75
278.75
1100.75
4411.75
17523
70323.8
Speed Up
257% 
209% 
256% 
259% 
259% 
264% 
261%

As we can see above, the AMD based CPU, Computer 1, received the biggest speed improvement and managed to surpass the Intel CPU which only had 4 cores. Both CPUs managed to achieve a much higher speed improvement much closer in line with the level of parallelism supported by each CPU. Computer 1 achieves nearly its ideal speed improvement of 6 times whilst computer achieves nearly 4 times.

C++ Threads (with SIMD)

These results relate to the threaded version that was also combined with SIMD to see if much more of a signifcant speed increase could be obtained. In the interests of keeping this article shorter the specific timing values have been omitted and can be found in the appendix.


C++ Threads
(with SIMD)
1024
2048
4096
8192
16384
32768
65536
Computer 1
Threads
Speed Up
420%
429%
412%
475%
489%
490%
498%
 Threads & SIMD
Speed Up
222%
393%
415%
433%
427%
423%
416%
Computer 2
Threads
Speed Up
257%
209%
256%
259%
259%
264%
261%
Threads & SIMD
Speed Up
344%
441%
397%
372%
359%
364%
362%

MPI 

There was a lot of data in the MPI test results and as such these have been omitted. I will post a link to the full report when it is completed so that exact figures can be seen. In the mean time I have plotted a graph below with speed differences. The main point to note for the MPI results was that for the small data sizes it resulted in significant slow down, the more computers thrown at the problem the slower it got. Only at the larger data sizes was a performance increase shown but it shows that MPI would be more effective than the other techniques at the largest of data sizes.

Final Comparisons

There is a lot more data in relation to the rest of the results but they can be found in the actual report.

The table below shows the percentage speed up of each different application over the varying data sets (Although useful, the information relating to the AMD system has not been plotted). Here we can see that the fastest version was indeed the GPU version which was several thousand times faster than the sequential version. Even in comparison to the other techniques it is a mile ahead. Despite this, C++ threads and SIMD manage to make a consistent speed up over all of the data sets. The major point of interest in this example is that the MPI version of the application actually performs significantly worse at smaller data sizes than all of the other applications. It does manage to make a speed up at the highest of data sizes and based on this trend I predict it could gain an even better speed up at larger data sizes.

Figure 1 - Graph of plotted percentage speed up
I wrote this in a bit of a rush so if there are any inaccuracies or inconsistencies or if you simply have a question feel drop me a message and I'll fix them up or get back to you ASAP.


Email: markmmiller@hotmail.co.uk
Xbox Live: Dr Death MK 2
Steam: 7thsanctum

Origin: 7thsanctum
Youtube: 7thsanctum
Github: 7thsanctum