Tuesday, July 3, 2007

Porting hell

I have been working on a project, porting something from Linux to Solaris and from the kernel to the userland. It was interesting and a learning experience.

I have always had a strong preference to work by myself - solo projects. This one wasn't. I always find working with other engineers a challenge. This time, however, was a relatively positive experience.

But, before I digress to much, let me talk about the port. I expected the kernel to userland part to be the challenge. Linux userland to Solaris userland, hmm, I didn't expect much trouble. I was in for a surprise. :)

There are so many differences, so many that you wonder if there are more differences than similarities. With Linux, POSIX seems irrelevant. ;)

But the most interesting was a null dereference panic we were seeing on Solaris; never happened on Linux. Turned out that a null pointer was being passed to a 'debug print' routine.

The interesting part was that this was not a programming error. This was done intentionally. What I learned was that the snprintf() routine when faced with a null pointer prints "null" instead of aborting. The code we were porting assumed this behavior and used it all over the place.

Finally, we came across a solution devised by Sun's engineers. The library /usr/lib/0@0.so.1 can be preloaded to mark the address 0 as valid. This is a workaround and dirty but effective. ;)

For more on issues porting to Solaris, see this document.

Saturday, April 28, 2007

Linux scaling to 1024 processors?

Hmm... See my earlier post about the giant lock in the linux kernel.

Inspite of that, (believe it or not ;) ), folks at SGI claim to have got linux to scale to 512 and 1024 processors.

How well this really scales, I don't have a clue. But it is very interesting and impressive, all the same, I must say!

-Manoj

Tuesday, April 17, 2007

AMD64 on Pentium D

This is what I saw on a build machine that I use.

-bash-3.00$ uname -a
SunOS sata5 5.11 snv_55b i86pc i386 i86pc
-bash-3.00$ /usr/sbin/prtdiag | grep CPU
Intel(R) Pentium(R) D CPU 2.80GHz LGA775/U1
-bash-3.00$ isainfo -v
64-bit amd64 applications
cx16 mon sse3 sse2 sse fxsr mmx cmov amd_sysc cx8 tsc fpu
32-bit i386 applications
cx16 mon sse3 sse2 sse fxsr mmx cmov sep cx8 tsc fpu
-bash-3.00$


An Intel Pentium processor does amd64 instructions! Ha ha! What has the world come to??

Wednesday, April 4, 2007

Giant lock in Linux

For a brief while, 5 years ago, I used to work on a Windows NT filesystem. More recently, I worked on a Solaris filesystem. And I will be working with Linux kernels in the forth coming days, or so it seems.

I have read about the old Linux kernels using a giant spin lock for synchronization. And I always thought it was w.r.t the old 2.0, 2.2 kernels.

A giant lock is like "one lock to rule the entire kernel". :) Using different locks (synchronization elements) to protect different kernel data structures and critical sections means more granularity and it leads to greater parallelisation. It also means more complexity.

Using the same lock to protect everything means lesser paralalization. Such a lock is often termed a giant lock.

Cscopeing in linux-2.6.9 source, I could see 527 direct calls to lock_kernel(). There are 39 more calls to reiserfs_write_lock(foobar) which is defined thus:
    #define reiserfs_write_lock( sb ) lock_kernel()
And to prevent recursive spinlock deadlocks, this is how lock_kernel() is defined:
static inline void lock_kernel(void)
{
int depth = current->lock_depth+1;
if (likely(!depth))
get_kernel_lock();
current->lock_depth = depth;
}
In the 2.4 kernels it apparently used to be
static __inline__ void lock_kernel(void)
{
if (!++current->lock_depth)
spin_lock(&kernel_flag);
}
I am very surprised!

I am tempted to quote what someone thought about the Linux kernel on NTFSD. But I won't. I'll be nice. ;)

-Manoj