beta scratch performance

ghp · Fri Jun 13, 2014 2:26 pm

Hello,

the last days, I executed some performance tests for nuscratch-beta, comparing it with current scratch and a win installation.
See the results in http://heppg.de/ikg/wordpress/?p=270

: performance_summary.png (14.47 KiB) Viewed 6893 times

Regards,
Gerhard

simplesi · Fri Jun 13, 2014 4:54 pm

What CPU speed was the Win test conducted on?

Simon

ghp · Fri Jun 13, 2014 5:08 pm

Hello,
the details are on the referenced page. It is an intel i3 2.2GHz processor. RPi runs with 1GHz.
Regards,
Gerhard

simplesi · Fri Jun 13, 2014 9:44 pm

I personally think that editing speedup is much more important than script running speed.

Although it would be nice to approach PC running speed, its the delays with editing that put youngsters off (with their attendant short attention spans).

As long as this speed is similar to 1.4/2.0 on PCs then it will be acceptable

Simon

timrowledge · Sat Jun 14, 2014 4:44 am

Well *I'm* impressed.

Sat Jun 14, 2014 9:44 am

simplesi wrote:I personally think that editing speedup is much more important than script running speed.

Although it would be nice to approach PC running speed, its the delays with editing that put youngsters off (with their attendant short attention spans).

As long as this speed is similar to 1.4/2.0 on PCs then it will be acceptable

Simon

How to you propose making a single core 700Mhz with mobile phone level graphics machine run as fast as a multicore 2-3Ghz machine with a desktop graphics card (in the general case, not specific benchmarks)? You have to remember that there are HW limitations that even heavily optimised software cannot get round.

I'm stunned Scratch runs as fast as it does!

teh_orph · Sat Jun 14, 2014 9:45 am

I'm impressed. In those last two results, is the pi really faster than PC? What's going on there?

simplesi · Sat Jun 14, 2014 10:15 am

@james

I'm saying the speed improvements are primarily needed in the editing/block moving part of Scratch

The fact that it won't run graphics as fast as a 3GHz isn't as big an issue.

Young Scratch attention span is

Lots of them use Scratch on and if creating projects on a Pi is too slow then some will give up

Give me some credit - this IS my area of expertise

Simon

Sat Jun 14, 2014 10:22 am

I believe you said "As long as this speed is similar to 1.4/2.0 on PCs then it will be acceptable ".

Are you really saying that after all the work that people have put in making Scratch faster, it still hasn't reached the level of 'acceptable'?

That may be impossible, for the reasons stated above. An modern desktop PC is about 10 times faster than a Raspberry Pi.

simplesi · Sat Jun 14, 2014 10:43 am

Is this going to be a 5min or the full half-hour one today

"acceptable" is a wooly word

I "accept" it as I work with it daily but others don't "accept" it as they switch between PC and Pi and go - wow - that's slow

We have all been "putting up" with performance as the upside (Scratch controlled Robots etc) is so huge.

I'm confident that NuScratch will provide "acceptable" editing performance but its not complete and currently has some big performance issues - main one being loading time.

I know you have a very low opinion of me - but I do have some game in this area

Simon

Sat Jun 14, 2014 1:16 pm

simplesi wrote:Is this going to be a 5min or the full half-hour one today

"acceptable" is a wooly word

I "accept" it as I work with it daily but others don't "accept" it as they switch between PC and Pi and go - wow - that's slow

We have all been "putting up" with performance as the upside (Scratch controlled Robots etc) is so huge.

I'm confident that NuScratch will provide "acceptable" editing performance but its not complete and currently has some big performance issues - main one being loading time.

I know you have a very low opinion of me - but I do have some game in this area

Simon

You said that the performance, even now, of Scratch, isn't acceptable. My point is, it may not get any/much better, it may not be possible to make it any faster, because the Raspi simply isn't fast enough to compete with a 3Gig desktop machine. That is clearly the case. The Raspi is a 700Mhz ARM, equivalent to a ten year old P3 300-400. I think your expectations may simply be too high. There is a law of diminishing returns with regard to optimisation. You grab the low hanging fruit, this is done. Then you start on the harder stuff. This is being done here with nuscratch. But as you go in to each new area of optimisation, the returns get fewer and fewer, you spent more time getting less extra performance. Look on it as exponential decay. There may be bumps of course - someone might come up with a neat idea to increase speed, that takes little effort - but tbh, I would think all that stuff has already been done. But at this stage, all the big improvements are likely to have been done already.

Loading may be an area that can be improved of course, but you may well be hitting either SD card or CPU bandwidth limitations, or both.

Sat Jun 14, 2014 1:18 pm

ghp wrote:Hello,

the last days, I executed some performance tests for nuscratch-beta, comparing it with current scratch and a win installation.
See the results in http://heppg.de/ikg/wordpress/?p=270
performance_summary.png

Regards,
Gerhard

Would you be able to repeat the test with the HW cursor implementation for X. I have a feeling that it won't help a huge amount, but I would like to know if it does help as it's one of the reasons I did it.

Code is here, a modification of the X turbo driver currently used on the Raspi. https://github.com/JamesH65/xf86-video-fbturbo

simplesi · Sat Jun 14, 2014 2:40 pm

So you've elected for the full half-hour then...

...but on the advice of legal council I'll just stop.

Simon

timrowledge · Sat Jun 14, 2014 5:37 pm

jamesh wrote:You said that the performance, even now, of Scratch, isn't acceptable. My point is, it may not get any/much better, it may not be possible to make it any faster, because the Raspi simply isn't fast enough to compete with a 3Gig desktop machine. That is clearly the case. The Raspi is a 700Mhz ARM, equivalent to a ten year old P3 300-400. I think your expectations may simply be too high. There is a law of diminishing returns with regard to optimisation. You grab the low hanging fruit, this is done. Then you start on the harder stuff. This is being done here with nuscratch. But as you go in to each new area of optimisation, the returns get fewer and fewer, you spent more time getting less extra performance. Look on it as exponential decay. There may be bumps of course - someone might come up with a neat idea to increase speed, that takes little effort - but tbh, I would think all that stuff has already been done. But at this stage, all the big improvements are likely to have been done already.

Speaking as the person doing the work I can assure you all that nowhere near all the possible improvements have been done. The Cog VM is still to arrive - that should roughly triple compute performance. The Spur memory manager looks like it might near double that. Sista is showing promise to double that. The Scratch code itself still has a lot of possible improvements. I want to see if we can make use of the gpu to really hit on graphics performance. I have some ideas on translating scratch scripts directly into bytecodes that could make compute performance dozens of times faster.

Interestingly, all these improvements (except the pi specific gpu stuff) will benefit all platforms, so yes the Pi will always be slower than a fast Mac running the equivalent vm/image. But that's not a sensible (moving) goalpost to shoot at (see, even a soccer hating nerd can make contemporaneous sport metaphors) and my target is the 2.4GHz dual i7 iMac I had at the start of my work on this. That ran roughly 15 times faster than the original Pi Scratch release. I think I can beat it, but there are certainly boundaries imposed by the Pi hardware, particularly the memory tram. (It's too slow to be a bus) It'll be interesting to see how it all goes. As long as I'm funded, I'll keep making it better.

simplesi · Sat Jun 14, 2014 6:39 pm

ghp · Sun Jun 15, 2014 8:06 am

Hello,

I have installed the ssvb/xf86-video-fbturbo and re-executed the graphic oriented tests. I added two move tests, which I usually explain to the kids in school as a scratch-antipattern: while true; goto x,y; inc x; inc y; endwhile; This works, but movement speed is limited by cpu-usage. The second is movements with variable display on stage. This slows down execution speed drastically in RPi-1.4-scratch.

: performance_summary_2.png (19.52 KiB) Viewed 6555 times

The chart is less intuitive to read now. The bar colors are blue, dark yellow for raspbian system, and light blue, light yellow for the modified driver. No bar means: test not executed.

Improvements are noticeable with factor 0.8 for both scratch and beta.
As the graphics shows relative performance to RPi-scratch-1.4, the differences look smaller for beta, as the bars in pixel count are lower.

The move2-sample is executed in both presentation mode and full stage, as the absolute differences are larger than in other samples (which have always been reported from presentation mode).
The absolute script execution timings are (in sec).

: move2_absolute.png (6.21 KiB) Viewed 6555 times

For this test, the script runs in less than 10% using beta and modified X.

Regards,
Gerhard

Sun Jun 15, 2014 4:10 pm

timrowledge wrote:
jamesh wrote:You said that the performance, even now, of Scratch, isn't acceptable. My point is, it may not get any/much better, it may not be possible to make it any faster, because the Raspi simply isn't fast enough to compete with a 3Gig desktop machine. That is clearly the case. The Raspi is a 700Mhz ARM, equivalent to a ten year old P3 300-400. I think your expectations may simply be too high. There is a law of diminishing returns with regard to optimisation. You grab the low hanging fruit, this is done. Then you start on the harder stuff. This is being done here with nuscratch. But as you go in to each new area of optimisation, the returns get fewer and fewer, you spent more time getting less extra performance. Look on it as exponential decay. There may be bumps of course - someone might come up with a neat idea to increase speed, that takes little effort - but tbh, I would think all that stuff has already been done. But at this stage, all the big improvements are likely to have been done already.
Speaking as the person doing the work I can assure you all that nowhere near all the possible improvements have been done. The Cog VM is still to arrive - that should roughly triple compute performance. The Spur memory manager looks like it might near double that. Sista is showing promise to double that. The Scratch code itself still has a lot of possible improvements. I want to see if we can make use of the gpu to really hit on graphics performance. I have some ideas on translating scratch scripts directly into bytecodes that could make compute performance dozens of times faster.

Interestingly, all these improvements (except the pi specific gpu stuff) will benefit all platforms, so yes the Pi will always be slower than a fast Mac running the equivalent vm/image. But that's not a sensible (moving) goalpost to shoot at (see, even a soccer hating nerd can make contemporaneous sport metaphors) and my target is the 2.4GHz dual i7 iMac I had at the start of my work on this. That ran roughly 15 times faster than the original Pi Scratch release. I think I can beat it, but there are certainly boundaries imposed by the Pi hardware, particularly the memory tram. (It's too slow to be a bus) It'll be interesting to see how it all goes. As long as I'm funded, I'll keep making it better.

That's really good to know. Useful to know what the expectations are. Do you think that the possible improvements are wholly cumulative (should that be multiplicative?)? So 3 * 3 * 2 = 18 times better performance? Avoiding GPU specific stuff would be the best option, and a speed up of that proportion would really mean the GPU work may not be necessary (unless it's multiplatform code e.,g.OpenGLES)

Don't fancy the workload though! Just goes to show how bloated code has got over the last few years. Lots of lazy programmers out there!

What you say about the memory subsystem is interesting, as I have it on good authority it's one of the best ARM SoC ones around (The Apple one is better)

Sun Jun 15, 2014 4:11 pm

ghp wrote:Hello,

I have installed the ssvb/xf86-video-fbturbo and re-executed the graphic oriented tests. I added two move tests, which I usually explain to the kids in school as a scratch-antipattern: while true; goto x,y; inc x; inc y; endwhile; This works, but movement speed is limited by cpu-usage. The second is movements with variable display on stage. This slows down execution speed drastically in RPi-1.4-scratch.
performance_summary_2.png
The chart is less intuitive to read now. The bar colors are blue, dark yellow for raspbian system, and light blue, light yellow for the modified driver. No bar means: test not executed.

Improvements are noticeable with factor 0.8 for both scratch and beta.
As the graphics shows relative performance to RPi-scratch-1.4, the differences look smaller for beta, as the bars in pixel count are lower.

The move2-sample is executed in both presentation mode and full stage, as the absolute differences are larger than in other samples (which have always been reported from presentation mode).
The absolute script execution timings are (in sec).
move2_absolute.png
For this test, the script runs in less than 10% using beta and modified X.

Regards,
Gerhard

Hi Gerhard,

can I confirm you used my fork of ssvb's code, rather than his version (since I haven't pushed my HW cursor code to his repo yet).

If so, that 0.8 gain is not too bad and well worth pushing.

James

ghp · Sun Jun 15, 2014 6:22 pm

Hello,

I used the code from https://github.com/JamesH65/xf86-video-fbturbo.

Will this go into the repo soon ? Or other way round: when it is assumed 'production ready', I would bring it to the systems for my school workshop. As even with current scratch, performance is better.

Regards,
Gerhard

Sun Jun 15, 2014 7:37 pm

Thanks for doing the testing. I've been pretty much sitting on it waiting for any issue reports but I think most people have missed it.It's good to know there is a performance improvement.

So if anyone else wants to test it, any results would be greatly appreciated. I'll send in a pull request to the main repo in a week or so if nothing untoward turns up.

timrowledge · Sun Jun 15, 2014 10:19 pm

jamesh wrote: Do you think that the possible improvements are wholly cumulative (should that be multiplicative?)? So 3 * 3 * 2 = 18 times better performance? Avoiding GPU specific stuff would be the best option, and a speed up of that proportion would really mean the GPU work may not be necessary (unless it's multiplatform code e.,g.OpenGLES)

To explain will take a bit of a wall of text; I'll try not to slip into otiosity nor insulting oversimplicity.

Performance is a horribly difficult thing to measure. We all know that and yet we all want a single number of goodness. Look at the terrible things that has done to education policy in the UK, just as one example. My guesstimates above are based on a lot of experience but not on any measurements so far since the work has not been done; much salt ingestion should be anticipated.

Compute performance in our context means how fast code that runs numeric or list processing or text handling type jobs. Graphics performance means moving pixels around whether on the glass or not. Application performance is how well our application seems to run. System performance is how well the system as a whole runs and it one of the slipperiest things to get a grip on.
Example - Eliot is developing the new memory system and the net performance depends drastically on whether he has Chrome running on his macbook; with a number of tabs open it roughly halves the apparent performance of his Squeak system. If it were possible to make the Smalltalk code run infinitely fast it would still leave the graphics. If we made graphics infinitely fast it would still leave the system overhead.

Improving Scratch has attacked the performance on five levels so far

-graphics; initially there were claims that graphics performance was the limit for Scratch on the Pi and I was able to fairly quickly disprove that. People look at a 'simple' script that merely rotates or flashes a sprite and don't notice that the script is being animated to some degree (depending on settings), the sprite icon is being animated, the sprites in the library are animating and so on. There's a lot going on. Where graphics was getting heavily used we were able (big kudos to Ben Avison) to make use of Squeak's flexibility to extend the graphics engine to use ARMv6 specific code and massively speed up some important operations. Being able to make use of the GPU would allow us to do a lot of interesting things much faster - antialiased fonts etc come to mind. Building a vector drawn UI instead of a bitblitted one would be nice too.

- execution engine; Scratch is built on an execution engine that kinda-sorta mimics a very parallel machine with a kinda-sorta stack oriented 'cpu'. So that is written in Smalltalk and interprets the Scratch script blocks fairly naively with much converting of strings to numbers and back again, and a large amount of code that didn't take advantage of Smalltalk's strengths. I fixed a *lot* of that, though there is certainly a lot more that could be done, not excluding my ideas for pseudo-compiling the scripts to real Smalltalk bytecodes.

- image upgrade; 'image' here refers not to a picture but the Smalltalk image file that is the object state file you run with the virtual machine. The original version of Scratch was written a long time ago in the days of Squeak 2.8 or so; things have moved on a fair bit since then and the old image file could not run on the latest virtual machines. So a major bit of work has been porting the code forward and it has been a lot more work than I could have imagined up front. A lot of very deep code changed in ways that took a lot of time to sort out and it really was an object (hah! pun!) lesson in how important really thorough documentation is. Fixing something when there is almost no information about 'working' means is tricky. The newer image has proper closures, better networking code, cleaner and more flexible handling of i18n, better code management tools etc so it is a big win for the long term.

- virtual machine; with the new image in place we were able to firmly move to the StackVM which improves compute performance by roughly 50%. The more advance CogVM that does dynamic translation of Smalltalk byte codes to ARM machine code ought to be about three times faster. It will be interesting to see how that works out since this is new work; we (Eliot & I in this case) have a bit of experience with Smalltalk, translation and even ARM (hmm, 25+ years actually) and Smalltalk has been doing dynamic translation since 1984 ( See the seminal paper by our old colleagues Peter Deutsch & Allan Schiffman who invented the idea. No, java did not invent it. Java didn't invent anything except new ways to drive people insane) The original system made things maybe 10x faster but you have to consider what the predecessor was. The next version was HPS (High Performance Smalltalk) and that formed the basis of the still current product called VisualWorks on more cpu architectures & OS's than most people could list. Squeak is/was a fairly clever but still simple interpreter and has got cleverer over the years; the CogVM is 'competing' with a higher starting point, hence the estimates of a 3x speedup. The improvement we're anticipating for the new memory manager system *should* appear on the Pi just as on the x86 but remember that the x86 has a humungous set of caches and a much much faster memory bus. The differences may make the new system very different. We'll find out. 'Sista' is an adaptive Smalltalk level optimiser that *should* improve things everywhere Smalltalk code is executing but we'll see. The really cool thing is that it is done in Smalltalk in a live system. Metaprogramming is so powerful.

- application code; a lot of places can be improved simply by use of normal software engineering - make the damn code better! The improved editing speed and the faster startup I sorted out last week are down to just 'getting it right'.

Whoops, and I promised to try to avoid prolixity.

jamesh wrote:What you say about the memory subsystem is interesting, as I have it on good authority it's one of the best ARM SoC ones around (The Apple one is better)

AIUI the SoC on the Pi only 'accidentally' has an ARM and we're lucky it's there at all. SoCs are designed for fairly specific markets and price points and that will affect the speed and breadth of the bus amongst so many other things. Apple spent a colossal amount of money to develop their SoC and it is some of the cleverest stuff I have ever seen. They also have around 25 years experience in making ARM systems to build on. And their selling price is a bit over $35....

Mon Jun 16, 2014 7:25 am

timrowledge wrote:
jamesh wrote: Do you think that the possible improvements are wholly cumulative (should that be multiplicative?)? So 3 * 3 * 2 = 18 times better performance? Avoiding GPU specific stuff would be the best option, and a speed up of that proportion would really mean the GPU work may not be necessary (unless it's multiplatform code e.,g.OpenGLES)
To explain will take a bit of a wall of text; I'll try not to slip into otiosity nor insulting oversimplicity.

Performance is a horribly difficult thing to measure. We all know that and yet we all want a single number of goodness. Look at the terrible things that has done to education policy in the UK, just as one example. My guesstimates above are based on a lot of experience but not on any measurements so far since the work has not been done; much salt ingestion should be anticipated.

Compute performance in our context means how fast code that runs numeric or list processing or text handling type jobs. Graphics performance means moving pixels around whether on the glass or not. Application performance is how well our application seems to run. System performance is how well the system as a whole runs and it one of the slipperiest things to get a grip on.
Example - Eliot is developing the new memory system and the net performance depends drastically on whether he has Chrome running on his macbook; with a number of tabs open it roughly halves the apparent performance of his Squeak system. If it were possible to make the Smalltalk code run infinitely fast it would still leave the graphics. If we made graphics infinitely fast it would still leave the system overhead.

Improving Scratch has attacked the performance on five levels so far

-graphics; initially there were claims that graphics performance was the limit for Scratch on the Pi and I was able to fairly quickly disprove that. People look at a 'simple' script that merely rotates or flashes a sprite and don't notice that the script is being animated to some degree (depending on settings), the sprite icon is being animated, the sprites in the library are animating and so on. There's a lot going on. Where graphics was getting heavily used we were able (big kudos to Ben Avison) to make use of Squeak's flexibility to extend the graphics engine to use ARMv6 specific code and massively speed up some important operations. Being able to make use of the GPU would allow us to do a lot of interesting things much faster - antialiased fonts etc come to mind. Building a vector drawn UI instead of a bitblitted one would be nice too.

- execution engine; Scratch is built on an execution engine that kinda-sorta mimics a very parallel machine with a kinda-sorta stack oriented 'cpu'. So that is written in Smalltalk and interprets the Scratch script blocks fairly naively with much converting of strings to numbers and back again, and a large amount of code that didn't take advantage of Smalltalk's strengths. I fixed a *lot* of that, though there is certainly a lot more that could be done, not excluding my ideas for pseudo-compiling the scripts to real Smalltalk bytecodes.

- image upgrade; 'image' here refers not to a picture but the Smalltalk image file that is the object state file you run with the virtual machine. The original version of Scratch was written a long time ago in the days of Squeak 2.8 or so; things have moved on a fair bit since then and the old image file could not run on the latest virtual machines. So a major bit of work has been porting the code forward and it has been a lot more work than I could have imagined up front. A lot of very deep code changed in ways that took a lot of time to sort out and it really was an object (hah! pun!) lesson in how important really thorough documentation is. Fixing something when there is almost no information about 'working' means is tricky. The newer image has proper closures, better networking code, cleaner and more flexible handling of i18n, better code management tools etc so it is a big win for the long term.

- virtual machine; with the new image in place we were able to firmly move to the StackVM which improves compute performance by roughly 50%. The more advance CogVM that does dynamic translation of Smalltalk byte codes to ARM machine code ought to be about three times faster. It will be interesting to see how that works out since this is new work; we (Eliot & I in this case) have a bit of experience with Smalltalk, translation and even ARM (hmm, 25+ years actually) and Smalltalk has been doing dynamic translation since 1984 ( See the seminal paper by our old colleagues Peter Deutsch & Allan Schiffman who invented the idea. No, java did not invent it. Java didn't invent anything except new ways to drive people insane) The original system made things maybe 10x faster but you have to consider what the predecessor was. The next version was HPS (High Performance Smalltalk) and that formed the basis of the still current product called VisualWorks on more cpu architectures & OS's than most people could list. Squeak is/was a fairly clever but still simple interpreter and has got cleverer over the years; the CogVM is 'competing' with a higher starting point, hence the estimates of a 3x speedup. The improvement we're anticipating for the new memory manager system *should* appear on the Pi just as on the x86 but remember that the x86 has a humungous set of caches and a much much faster memory bus. The differences may make the new system very different. We'll find out. 'Sista' is an adaptive Smalltalk level optimiser that *should* improve things everywhere Smalltalk code is executing but we'll see. The really cool thing is that it is done in Smalltalk in a live system. Metaprogramming is so powerful.

- application code; a lot of places can be improved simply by use of normal software engineering - make the damn code better! The improved editing speed and the faster startup I sorted out last week are down to just 'getting it right'.

Whoops, and I promised to try to avoid prolixity.

jamesh wrote:What you say about the memory subsystem is interesting, as I have it on good authority it's one of the best ARM SoC ones around (The Apple one is better)
AIUI the SoC on the Pi only 'accidentally' has an ARM and we're lucky it's there at all. SoCs are designed for fairly specific markets and price points and that will affect the speed and breadth of the bus amongst so many other things. Apple spent a colossal amount of money to develop their SoC and it is some of the cleverest stuff I have ever seen. They also have around 25 years experience in making ARM systems to build on. And their selling price is a bit over $35....

Thanks Tim, very useful data there.

With regard to the SoC memory subsystem, it's actually a little more sophisticated that just slapping a ARM on to a Videocore4! (Although that is actually pretty much the end result). The guy who wrote the memory controller RTL is actually very very good at it, and did a really good job. Compared with devices from people like AllWinner and the like, it's pretty fast. At the time of Raspi launch, IIRC, the only faster one we knew about was the Apple one, which for the reasons you give, is indeed better. I don't have the figures on me, but I'm sure I could extract them from Gert at some point, but they may be company confidential.

beta scratch performance

beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance

Re: beta scratch performance