Tight Loop Examples

Yielding in Charcoal is quite cheap, but it's not free. For the sake of back-of-the-envelope performance estimates, you can assume a yield that immediately returns back to the current activity costs maybe a dozen instructions. In most code this overhead shouldn't be worth thinking about, but in tight inner loops it can be painful. For example:

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
  
  
  
  
  
  
  
  
  
  
float dot_product( size_t N, float *A, float *B )
{
    
float rv = 0.0;
    
size_t i;
    
for( i = 0; i < N; ++i )
    
{
        
rv += A[i] * b[i];
    
}
    
return rv;
}

The inner loop here is probably well under a dozen instructions, so adding a yield on every iteration (which happens by default in Charcoal) more than doubles the run time. If someone cares about the performance of this code at all, that's probably unacceptable. The simplest "fix" for this problem is to use the for loop variant that does not yield by default.

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
  
  
  
  
  
  
  
  
  
  
float dot_product( size_t N, float *A, float *B )
{
    
float rv = 0.0;
    
size_t i;
    
for_no_yield( i = 0; i < N; ++i )
    
{
        
rv += A[i] * b[i];
    
}
    
return rv;
}

This version avoids yielding too frequently, but if it is ever called with a large input, it may not yield frequently enough. Well-behaved Charcoal programs should go no more than a few milliseconds between yields.

Hitting a sweet spot in terms of yield frequency, while adding zero overhead to the inner loop requires somewhat more complex code.

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
#define BLOCK_SIZE 32

float dot_product( size_t N, float *A, float *B )
{
    
float rv = 0.0;
    
size_t i = 0, j;
    
for( j = BLOCK_SIZE; j < N; j += BLOCK_SIZE )
    
{
        
for_no_yield( ; i < j; ++i )
        
{
            
rv += A[i] * b[i];
        
}
    
}
    
for_no_yield( ; i < N; ++i )
    
{
        
rv += A[i] * b[i];
    
}
    
return rv;
}

[Some explanatory text here]

(Bug hunter bonus points for anyone who noticed there is a (probably extremely rare) overflow bug when N > SIZE_MAX - BLOCK_SIZE. This shouldn't be a complex bug to fix.)

Things get a little trickier if the loop condition is data-dependent.

1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
  
  
  
  
  
  
  
  
  
  
#define BLOCK_SIZE 32

char *mystrcpy( char *dest, const char *src )
{
    
char *rv = dst;
    
while_no_yield( *dst++ = *src++ )
        
if( 0 == dst % BLOCK_SIZE )
            
yield;
    
return rv;
}

This example is just slightly less satisfying than the previous one performance-wise, because we had to add an extra branch in the inner loop. There is no easy way around this, because termination of the loop might happen at any time. On the other hand, high performance implementations of strcpy are usually quite architecture-dependent, so it might be possible to sneak the periodic yield in to one of those with no performance ill effects.

This example also nicely illustrates that the X_no_yield statements (where X could be for, while, goto, ...) are distinct from the unyielding keyword. The "no yield" variants simply tell the compiler not to insert the implicit yields that it would for the "normal" variants. The while_no_yield on line 6 above does not interfere with the explcit yield on line 8.

Conclusion

Most Charcoal code shouldn't need tricks like this. Most applications have only a few performance-critical inner loops (if they are CPU-bound at all). For those tight inner loops, Charcoal does impose a little code complexity penalty to make sure yields happen at a good rate. However, the penalty is not terribly high. The story is a little more complex for library code, and that is addressed in a different example.


Creative Commons License
Charcoal Programming Language by Benjamin Ylvisaker is licensed under a Creative Commons Attribution 4.0 International License