jamesh wrote: Do you think that the possible improvements are wholly cumulative (should that be multiplicative?)? So 3 * 3 * 2 = 18 times better performance? Avoiding GPU specific stuff would be the best option, and a speed up of that proportion would really mean the GPU work may not be necessary (unless it's multiplatform code e.,g.OpenGLES)
To explain will take a bit of a wall of text; I'll try not to slip into otiosity nor insulting oversimplicity.
Performance is a horribly difficult thing to measure. We all know that and yet we all want a single number of goodness. Look at the terrible things that has done to education policy in the UK, just as one example. My guesstimates above are based on a lot of experience but not on any measurements so far since the work has not been done; much salt ingestion should be anticipated.
Compute performance in our context means how fast code that runs numeric or list processing or text handling type jobs. Graphics performance means moving pixels around whether on the glass or not. Application performance is how well our application seems to run. System performance is how well the system as a whole runs and it one of the slipperiest things to get a grip on.
Example - Eliot is developing the new memory system and the net performance depends drastically on whether he has Chrome running on his macbook; with a number of tabs open it roughly halves the apparent performance of his Squeak system. If it were possible to make the Smalltalk code run infinitely fast it would still leave the graphics. If we made graphics infinitely fast it would still leave the system overhead.
Improving Scratch has attacked the performance on five levels so far
-graphics; initially there were claims that graphics performance was the limit for Scratch on the Pi and I was able to fairly quickly disprove that. People look at a 'simple' script that merely rotates or flashes a sprite and don't notice that the script is being animated to some degree (depending on settings), the sprite icon is being animated, the sprites in the library are animating and so on. There's a lot going on. Where graphics was getting heavily used we were able (big kudos to Ben Avison) to make use of Squeak's flexibility to extend the graphics engine to use ARMv6 specific code and massively speed up some important operations. Being able to make use of the GPU would allow us to do a lot of interesting things much faster - antialiased fonts etc come to mind. Building a vector drawn UI instead of a bitblitted one would be nice too.
- execution engine; Scratch is built on an execution engine that kinda-sorta mimics a very parallel machine with a kinda-sorta stack oriented 'cpu'. So that is written in Smalltalk and interprets the Scratch script blocks fairly naively with much converting of strings to numbers and back again, and a large amount of code that didn't take advantage of Smalltalk's strengths. I fixed a *lot* of that, though there is certainly a lot more that could be done, not excluding my ideas for pseudo-compiling the scripts to real Smalltalk bytecodes.
- image upgrade; 'image' here refers not to a picture but the Smalltalk image file that is the object state file you run with the virtual machine. The original version of Scratch was written a long time ago in the days of Squeak 2.8 or so; things have moved on a fair bit since then and the old image file could not run on the latest virtual machines. So a major bit of work has been porting the code forward and it has been a lot more work than I could have imagined up front. A lot of very deep code changed in ways that took a lot of time to sort out and it really was an object (hah! pun!) lesson in how important really thorough documentation is. Fixing something when there is almost no information about 'working' means is tricky. The newer image has proper closures, better networking code, cleaner and more flexible handling of i18n, better code management tools etc so it is a big win for the long term.
- virtual machine; with the new image in place we were able to firmly move to the StackVM which improves compute performance by roughly 50%. The more advance CogVM that does dynamic translation of Smalltalk byte codes to ARM machine code ought to be about three times faster. It will be interesting to see how that works out since this is new work; we (Eliot & I in this case) have a bit of experience with Smalltalk, translation and even ARM (hmm, 25+ years actually) and Smalltalk has been doing dynamic translation since 1984 ( See the seminal paper by our old colleagues Peter Deutsch & Allan Schiffman who invented the idea. No, java did not invent it. Java didn't invent anything except new ways to drive people insane) The original system made things maybe 10x faster but you have to consider what the predecessor was. The next version was HPS (High Performance Smalltalk) and that formed the basis of the still current product called VisualWorks on more cpu architectures & OS's than most people could list. Squeak is/was a fairly clever but still simple interpreter and has got cleverer over the years; the CogVM is 'competing' with a higher starting point, hence the estimates of a 3x speedup. The improvement we're anticipating for the new memory manager system *should* appear on the Pi just as on the x86 but remember that the x86 has a humungous set of caches and a much much faster memory bus. The differences may make the new system very different. We'll find out. 'Sista' is an adaptive Smalltalk level optimiser that *should* improve things everywhere Smalltalk code is executing but we'll see. The really cool thing is that it is done in Smalltalk in a live system. Metaprogramming is so powerful.
- application code; a lot of places can be improved simply by use of normal software engineering - make the damn code better! The improved editing speed and the faster startup I sorted out last week are down to just 'getting it right'.
Whoops, and I promised to try to avoid prolixity.
jamesh wrote:What you say about the memory subsystem is interesting, as I have it on good authority it's one of the best ARM SoC ones around (The Apple one is better)
AIUI the SoC on the Pi only 'accidentally' has an ARM and we're lucky it's there at all. SoCs are designed for fairly specific markets and price points and that will affect the speed and breadth of the bus amongst so many other things. Apple spent a colossal amount of money to develop their SoC and it is some of the cleverest stuff I have ever seen. They also have around 25 years experience in making ARM systems to build on. And their selling price is a bit over $35....