Player modeling in TORCS exploiting SVMs and GPUs parallelism

From AIRWiki
Jump to: navigation, search

Part 1: Project profile

Project name

Player modeling in TORCS exploiting SVMs and GPUs parallelism

Project short description

This project is aimed at building a TORCS' driving agent, using SVMs (Support Vector Machines) as the main Machine Learning algorithm. This agent (also called Bot) must be able to drive versus other agents, with comparable or better performances. To fulfill this achievement, and to be able to train SVMs in feasible time, this work will initially focuses on using the CUDA architecture to rewrite and optimize a well-known SVM Toolkit, LibSVM.

Dates

Start date: ~2008/03/01

End date: ~2009/03/01

Website(s)

http://cig.dei.polimi.it

People involved

Project head(s)

P. L. Lanzi - User:PierLucaLanzi

D. Loiacono - User:DanieleLoiacono

Other Politecnico di Milano people

None.

Students currently working on the project

Fabrizio Bonfanti - User:FabrizioBonfanti

Students who worked on the project in the past

None.

External personnel

None.

Laboratory work and risk analysis

This is meant to be a software project, and all required hardware is used remotely, so no laboratory work is expected. No risks are associated to this project, hopefully.

Part 2: Project description

Currently, my CUDA implementation is being developed, and has already survived lots of problems. Read on for a small briefing of the most important milestones.

The really first implementation

This implementation was the first funcional one, it supported c-SVM and was tested against a good package of datasets.

Pros:

  • Same results (nu, rho, sv, bsv, prediction %) compared to LibSVM

Cons:

  • The Q matrix (#data x #data) was completely calculated before the real training
    • Memory cost
    • Speed cost
    • Could handle problems with less than ~7000 vectors
  • Slower than LibSVM on nearly all the training datasets

The second implementation

Since results weren't as good as I expected, I decided to take another direction, effectively trashing the old work and starting fron scratch.

Pros:

  • New algorithm, focalised on c-SVM without any LibSVM's optimization
  • Nearly same results (nu, rho, sv, bsv, prediction %) compared to LibSVM
  • Lines of Q are now calculated on demand
    • Freed a lot of memory
  • Can solve problems of nearly any cardinality (limited by the GPU memory, however the biggest correctly solved was ~60000 vectors)
  • Reached 3x speedup on a pair of dataset, still slower than 1x on the vast majority
  • Tested on more datasets

Cons:

  • Necessity of some kind of Q-caching, not easy to do on this implementation
  • Time depends heavily on the number of iterations, since rows are calculated on demand
  • Why the speedup is so variable was still not clear

The third implementation

My assignment changed, and the implementation of epsilon-SVR was required too. Looking at the math behind the two earlier implementations, I saw that there really weren't a lot of conceptual differences, so I decided to return back on the first implementation, adding the features that were working on the second one. Speedup is very variable, but it seems that this implementation works better with big problems. I suspect that the major point of speedup is about calculate the lines of Q, so problems with higher cardinality gain effectively a lot on this phase.

Pros:

  • epsilon-SVR ossibility added to the c-SVC implementation
  • Same results (nu, rho, sv, bsv, prediction %) compared to LibSVM
  • Lines of Q are now calculated on demand
    • Freed a lot of memory
  • Lines of Q are now cached
    • Speed gain
  • Can solve problems of nearly any cardinality (limited by the GPU memory, however the biggest correctly solved was ~60000 vectors)
  • Reached 4x speedup on a pair of dataset, still slower than 1x on the smaller problems
  • Tested on more and more datasets

Cons:

  • Some point of the algorithm are nearly blocking
  • Overlaps of data transfer and CUDA executions are not exploited
    • The algorithm itself is very serial
    • Caching needs to serialize the lookup phase (CPU) and the calculus (GPU, if required)
  • Time depends heavily on the number of iterations, since rows are calculated on demand

Now

Now, I'm working again on the last implementation. I'm testing times/performances on a lot of datasets, and I'm also studying how the SVM's parameters can change the overall performance.

I'm also looking if parallelizing multiple SVMs is feasible. My guess is that it is possible, but looking at the nature of the algorithm, I think the performance gain wouldn't be so high.

Part 3: Useful links

TORCS - http://torcs.sourceforge.net/

LibSVM - http://www.csie.ntu.edu.tw/~cjlin/libsvm/

CUDA - http://www.nvidia.com/object/cuda_home.html