<http://www.tldp.org/HOWTO/KernelAnalysis-HOWTO-4.html>


    Kernel


OS (Operating System)
Linux - XWindow
Windows (Win9x, WinME, WinNT, Win2K, WinXP)


    User Mode and Kernel Mode


      680000


      Operative modes


   1. Kernel Mode: the machine operates with critical data structure,
      direct hardware (IN/OUT or memory mapped), direct memory, IRQ,
      DMA, and so on.
   2. User Mode: users can run applications.

                      
               |          Applications           /|\
               |         ______________           |
               |         | User Mode  |           |  
               |         ______________           | 
               |               |                  |  
Implementation |        _______ _______           |   Abstraction
    Detail     |        | Kernel Mode |           |
               |        _______________           |
               |               |                  |
               |               |                  | 
               |               |                  |
              \|/          Hardware               |

Kernel Mode "prevents" User Mode applications from damaging the system
or its features.

Modern microprocessors implement in hardware at least 2 different
states. For example under Intel, 4 states determine the PL (Privilege
Level). It is possible to use 0,1,2,3 states, with 0 used in Kernel Mode.

Unix OS requires only 2 privilege levels, and we will use such a
paradigm as point of reference.


    Switching from User Mode to Kernel Mode

   1. System Call: code living in Kernel Mode
   2. IRQ (or exception) : the IRQ an IRQ handler (or exception handler)
      is called


      System Calls

System calls are like special functions that manage OS routines which
live in Kernel Mode.

A system call can be called when we:

    * access an I/O device or a file (like read or write)
    * need to access privileged information (like pid, changing
      scheduling policy or other information)
    * need to change execution context (like forking or executing some
      other application)
    * need to execute a particular command (like ''chdir'', ''kill",
      ''brk'', or ''signal'')

                                 |                |
                         ------->| System Call i  | (Accessing Devices)
|                |       |       |  [sys_read()]  |
| ...            |       |       |                |
| system_call(i) |--------       |                |
|   [read()]     |               |                |
| ...            |               |                |
| system_call(j) |--------       |                |  
|   [get_pid()]  |       |       |                |
| ...            |       ------->| System Call j  | (Accessing kernel data structures)
|                |               |  [sys_getpid()]|
                                 |                | 
 
    USER MODE                        KERNEL MODE
 
  
                        Unix System Calls Working 

System calls are almost the only interface used by User Mode to talk
with low level resources (hardware). The only exception to this
statement is when a process uses ''ioperm'' system call. In this case a
device can be accessed directly by User Mode process (IRQs cannot be used).

System Calls under Linux Kernel 2.4.17, from [ arch/i386/kernel/entry.S ]

        .long SYMBOL_NAME(sys_ni_syscall)       /* 0  -  old "setup()" system call*/
        .long SYMBOL_NAME(sys_exit)
        .long SYMBOL_NAME(sys_fork)
        .long SYMBOL_NAME(sys_read)
        .long SYMBOL_NAME(sys_write)
        .long SYMBOL_NAME(sys_open)             /* 5 */
        .long SYMBOL_NAME(sys_close)
        .long SYMBOL_NAME(sys_waitpid)
        .long SYMBOL_NAME(sys_creat)
        .long SYMBOL_NAME(sys_link)
        .long SYMBOL_NAME(sys_unlink)           /* 10 */
        .long SYMBOL_NAME(sys_execve)
        .long SYMBOL_NAME(sys_chdir)
        .long SYMBOL_NAME(sys_time)
        .long SYMBOL_NAME(sys_mknod)
        .long SYMBOL_NAME(sys_chmod)            /* 15 */
        .long SYMBOL_NAME(sys_lchown16)
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old break syscall holder */
        .long SYMBOL_NAME(sys_stat)
        .long SYMBOL_NAME(sys_lseek)
        .long SYMBOL_NAME(sys_getpid)           /* 20 */
        .long SYMBOL_NAME(sys_mount)
        .long SYMBOL_NAME(sys_oldumount)
        .long SYMBOL_NAME(sys_setuid16)
        .long SYMBOL_NAME(sys_getuid16)
        .long SYMBOL_NAME(sys_stime)            /* 25 */
        .long SYMBOL_NAME(sys_ptrace)
        .long SYMBOL_NAME(sys_alarm)
        .long SYMBOL_NAME(sys_fstat)
        .long SYMBOL_NAME(sys_pause)
        .long SYMBOL_NAME(sys_utime)            /* 30 */
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old stty syscall holder */
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old gtty syscall holder */
        .long SYMBOL_NAME(sys_access)
        .long SYMBOL_NAME(sys_nice)
        .long SYMBOL_NAME(sys_ni_syscall)       /* 35 */                /* old ftime syscall holder */
        .long SYMBOL_NAME(sys_sync)
        .long SYMBOL_NAME(sys_kill)
        .long SYMBOL_NAME(sys_rename)
        .long SYMBOL_NAME(sys_mkdir)
        .long SYMBOL_NAME(sys_rmdir)            /* 40 */
        .long SYMBOL_NAME(sys_dup)
        .long SYMBOL_NAME(sys_pipe)
        .long SYMBOL_NAME(sys_times)
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old prof syscall holder */
        .long SYMBOL_NAME(sys_brk)              /* 45 */
        .long SYMBOL_NAME(sys_setgid16)
        .long SYMBOL_NAME(sys_getgid16)
        .long SYMBOL_NAME(sys_signal)
        .long SYMBOL_NAME(sys_geteuid16)
        .long SYMBOL_NAME(sys_getegid16)        /* 50 */
        .long SYMBOL_NAME(sys_acct)
        .long SYMBOL_NAME(sys_umount)                                   /* recycled never used phys() */
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old lock syscall holder */
        .long SYMBOL_NAME(sys_ioctl)
        .long SYMBOL_NAME(sys_fcntl)            /* 55 */
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old mpx syscall holder */
        .long SYMBOL_NAME(sys_setpgid)
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old ulimit syscall holder */
        .long SYMBOL_NAME(sys_olduname)
        .long SYMBOL_NAME(sys_umask)            /* 60 */
        .long SYMBOL_NAME(sys_chroot)
        .long SYMBOL_NAME(sys_ustat)
        .long SYMBOL_NAME(sys_dup2)
        .long SYMBOL_NAME(sys_getppid)
        .long SYMBOL_NAME(sys_getpgrp)          /* 65 */
        .long SYMBOL_NAME(sys_setsid)
        .long SYMBOL_NAME(sys_sigaction)
        .long SYMBOL_NAME(sys_sgetmask)
        .long SYMBOL_NAME(sys_ssetmask)
        .long SYMBOL_NAME(sys_setreuid16)       /* 70 */
        .long SYMBOL_NAME(sys_setregid16)
        .long SYMBOL_NAME(sys_sigsuspend)
        .long SYMBOL_NAME(sys_sigpending)
        .long SYMBOL_NAME(sys_sethostname)
        .long SYMBOL_NAME(sys_setrlimit)        /* 75 */
        .long SYMBOL_NAME(sys_old_getrlimit)
        .long SYMBOL_NAME(sys_getrusage)
        .long SYMBOL_NAME(sys_gettimeofday)
        .long SYMBOL_NAME(sys_settimeofday)
        .long SYMBOL_NAME(sys_getgroups16)      /* 80 */
        .long SYMBOL_NAME(sys_setgroups16)
        .long SYMBOL_NAME(old_select)
        .long SYMBOL_NAME(sys_symlink)
        .long SYMBOL_NAME(sys_lstat)
        .long SYMBOL_NAME(sys_readlink)         /* 85 */
        .long SYMBOL_NAME(sys_uselib)
        .long SYMBOL_NAME(sys_swapon)
        .long SYMBOL_NAME(sys_reboot)
        .long SYMBOL_NAME(old_readdir)
        .long SYMBOL_NAME(old_mmap)             /* 90 */
        .long SYMBOL_NAME(sys_munmap)
        .long SYMBOL_NAME(sys_truncate)
        .long SYMBOL_NAME(sys_ftruncate)
        .long SYMBOL_NAME(sys_fchmod)
        .long SYMBOL_NAME(sys_fchown16)         /* 95 */
        .long SYMBOL_NAME(sys_getpriority)
        .long SYMBOL_NAME(sys_setpriority)
        .long SYMBOL_NAME(sys_ni_syscall)                               /* old profil syscall holder */
        .long SYMBOL_NAME(sys_statfs)
        .long SYMBOL_NAME(sys_fstatfs)          /* 100 */
        .long SYMBOL_NAME(sys_ioperm)
        .long SYMBOL_NAME(sys_socketcall)
        .long SYMBOL_NAME(sys_syslog)
        .long SYMBOL_NAME(sys_setitimer)
        .long SYMBOL_NAME(sys_getitimer)        /* 105 */
        .long SYMBOL_NAME(sys_newstat)
        .long SYMBOL_NAME(sys_newlstat)
        .long SYMBOL_NAME(sys_newfstat)
        .long SYMBOL_NAME(sys_uname)
        .long SYMBOL_NAME(sys_iopl)             /* 110 */
        .long SYMBOL_NAME(sys_vhangup)
        .long SYMBOL_NAME(sys_ni_syscall)       /* old "idle" system call */
        .long SYMBOL_NAME(sys_vm86old)
        .long SYMBOL_NAME(sys_wait4)
        .long SYMBOL_NAME(sys_swapoff)          /* 115 */
        .long SYMBOL_NAME(sys_sysinfo)
        .long SYMBOL_NAME(sys_ipc)
        .long SYMBOL_NAME(sys_fsync)
        .long SYMBOL_NAME(sys_sigreturn)
        .long SYMBOL_NAME(sys_clone)            /* 120 */
        .long SYMBOL_NAME(sys_setdomainname)
        .long SYMBOL_NAME(sys_newuname)
        .long SYMBOL_NAME(sys_modify_ldt)
        .long SYMBOL_NAME(sys_adjtimex)
        .long SYMBOL_NAME(sys_mprotect)         /* 125 */
        .long SYMBOL_NAME(sys_sigprocmask)
        .long SYMBOL_NAME(sys_create_module)
        .long SYMBOL_NAME(sys_init_module)
        .long SYMBOL_NAME(sys_delete_module)
        .long SYMBOL_NAME(sys_get_kernel_syms)  /* 130 */
        .long SYMBOL_NAME(sys_quotactl)
        .long SYMBOL_NAME(sys_getpgid)
        .long SYMBOL_NAME(sys_fchdir)
        .long SYMBOL_NAME(sys_bdflush)
        .long SYMBOL_NAME(sys_sysfs)            /* 135 */
        .long SYMBOL_NAME(sys_personality)
        .long SYMBOL_NAME(sys_ni_syscall)       /* for afs_syscall */
        .long SYMBOL_NAME(sys_setfsuid16)
        .long SYMBOL_NAME(sys_setfsgid16)
        .long SYMBOL_NAME(sys_llseek)           /* 140 */
        .long SYMBOL_NAME(sys_getdents)
        .long SYMBOL_NAME(sys_select)
        .long SYMBOL_NAME(sys_flock)
        .long SYMBOL_NAME(sys_msync)
        .long SYMBOL_NAME(sys_readv)            /* 145 */
        .long SYMBOL_NAME(sys_writev)
        .long SYMBOL_NAME(sys_getsid)
        .long SYMBOL_NAME(sys_fdatasync)
        .long SYMBOL_NAME(sys_sysctl)
        .long SYMBOL_NAME(sys_mlock)            /* 150 */
        .long SYMBOL_NAME(sys_munlock)
        .long SYMBOL_NAME(sys_mlockall)
        .long SYMBOL_NAME(sys_munlockall)
        .long SYMBOL_NAME(sys_sched_setparam)
        .long SYMBOL_NAME(sys_sched_getparam)   /* 155 */
        .long SYMBOL_NAME(sys_sched_setscheduler)
        .long SYMBOL_NAME(sys_sched_getscheduler)
        .long SYMBOL_NAME(sys_sched_yield)
        .long SYMBOL_NAME(sys_sched_get_priority_max)
        .long SYMBOL_NAME(sys_sched_get_priority_min)  /* 160 */
        .long SYMBOL_NAME(sys_sched_rr_get_interval)
        .long SYMBOL_NAME(sys_nanosleep)
        .long SYMBOL_NAME(sys_mremap)
        .long SYMBOL_NAME(sys_setresuid16)
        .long SYMBOL_NAME(sys_getresuid16)      /* 165 */
        .long SYMBOL_NAME(sys_vm86)
        .long SYMBOL_NAME(sys_query_module)
        .long SYMBOL_NAME(sys_poll)
        .long SYMBOL_NAME(sys_nfsservctl)
        .long SYMBOL_NAME(sys_setresgid16)      /* 170 */
        .long SYMBOL_NAME(sys_getresgid16)
        .long SYMBOL_NAME(sys_prctl)
        .long SYMBOL_NAME(sys_rt_sigreturn)
        .long SYMBOL_NAME(sys_rt_sigaction)
        .long SYMBOL_NAME(sys_rt_sigprocmask)   /* 175 */
        .long SYMBOL_NAME(sys_rt_sigpending)
        .long SYMBOL_NAME(sys_rt_sigtimedwait)
        .long SYMBOL_NAME(sys_rt_sigqueueinfo)
        .long SYMBOL_NAME(sys_rt_sigsuspend)
        .long SYMBOL_NAME(sys_pread)            /* 180 */
        .long SYMBOL_NAME(sys_pwrite)
        .long SYMBOL_NAME(sys_chown16)
        .long SYMBOL_NAME(sys_getcwd)
        .long SYMBOL_NAME(sys_capget)
        .long SYMBOL_NAME(sys_capset)           /* 185 */
        .long SYMBOL_NAME(sys_sigaltstack)
        .long SYMBOL_NAME(sys_sendfile)
        .long SYMBOL_NAME(sys_ni_syscall)               /* streams1 */
        .long SYMBOL_NAME(sys_ni_syscall)               /* streams2 */
        .long SYMBOL_NAME(sys_vfork)            /* 190 */
        .long SYMBOL_NAME(sys_getrlimit)
        .long SYMBOL_NAME(sys_mmap2)
        .long SYMBOL_NAME(sys_truncate64)
        .long SYMBOL_NAME(sys_ftruncate64)
        .long SYMBOL_NAME(sys_stat64)           /* 195 */
        .long SYMBOL_NAME(sys_lstat64)
        .long SYMBOL_NAME(sys_fstat64)
        .long SYMBOL_NAME(sys_lchown)
        .long SYMBOL_NAME(sys_getuid)
        .long SYMBOL_NAME(sys_getgid)           /* 200 */
        .long SYMBOL_NAME(sys_geteuid)
        .long SYMBOL_NAME(sys_getegid)
        .long SYMBOL_NAME(sys_setreuid)
        .long SYMBOL_NAME(sys_setregid)
        .long SYMBOL_NAME(sys_getgroups)        /* 205 */
        .long SYMBOL_NAME(sys_setgroups)
        .long SYMBOL_NAME(sys_fchown)
        .long SYMBOL_NAME(sys_setresuid)
        .long SYMBOL_NAME(sys_getresuid)
        .long SYMBOL_NAME(sys_setresgid)        /* 210 */
        .long SYMBOL_NAME(sys_getresgid)
        .long SYMBOL_NAME(sys_chown)
        .long SYMBOL_NAME(sys_setuid)
        .long SYMBOL_NAME(sys_setgid)
        .long SYMBOL_NAME(sys_setfsuid)         /* 215 */
        .long SYMBOL_NAME(sys_setfsgid)
        .long SYMBOL_NAME(sys_pivot_root)
        .long SYMBOL_NAME(sys_mincore)
        .long SYMBOL_NAME(sys_madvise)
        .long SYMBOL_NAME(sys_getdents64)       /* 220 */
        .long SYMBOL_NAME(sys_fcntl64)
        .long SYMBOL_NAME(sys_ni_syscall)       /* reserved for TUX */
        .long SYMBOL_NAME(sys_ni_syscall)       /* Reserved for Security */
        .long SYMBOL_NAME(sys_gettid)
        .long SYMBOL_NAME(sys_readahead)        /* 225 */


      IRQ Event

When an IRQ comes, the task that is running is interrupted in order to
service the IRQ Handler.

After the IRQ is handled, control returns backs exactly to point of
interrupt, like nothing happened.

           
              Running Task 
             |-----------|          (3)
NORMAL       |   |       | [break execution] IRQ Handler
EXECUTION (1)|   |       |     ------------->|---------| 
             |  \|/      |     |             |  does   |         
 IRQ (2)---->| ..        |----->             |  some   |      
             |   |       |<-----             |  work   |       
BACK TO      |   |       |     |             |  ..(4). |
NORMAL    (6)|  \|/      |     <-------------|_________|
EXECUTION    |___________|  [return to code]
                                    (5)
               USER MODE                     KERNEL MODE

         User->Kernel Mode Transition caused by IRQ event
     

The numbered steps below refer to the sequence of events in the diagram
above:

   1. Process is executing
   2. IRQ comes while the task is running.
   3. Task is interrupted to call an "Interrupt handler".
   4. The "Interrupt handler" code is executed.
   5. Control returns back to task user mode (as if nothing happened)
   6. Process returns back to normal execution

Special interest has the Timer IRQ, coming every TIMER ms to manage:

   1. Alarms
   2. System and task counters (used by schedule to decide when stop a
      process or for accounting)
   3. Multitasking based on wake up mechanism after TIMESLICE time.


    3.4 Multitasking


      Mechanism

The key point of modern OSs is the "Task". The Task is an application
running in memory sharing all resources (included CPU and Memory) with
other Tasks.

This "resource sharing" is managed by the "Multitasking Mechanism". The
Multitasking Mechanism switches from one task to another after a
"timeslice" time. Users have the "illusion" that they own all resources.
We can also imagine a single user scenario, where a user can have the
"illusion" of running many tasks at the same time.

To implement this multitasking, the task uses "the state" variable,
which can be:

   1. READY, ready for execution
   2. BLOCKED, waiting for a resource

The task state is managed by its presence in a relative list: READY list
and BLOCKED list.


      Task Switching

The movement from one task to another is called ''Task Switching''. many
computers have a hardware instruction which automatically performs this
operation. Task Switching occurs in the following cases:

   1. After Timeslice ends: we need to schedule a "Ready for execution"
      task and give it access.
   2. When a Task has to wait for a device: we need to schedule a new
      task and switch to it *

* We schedule another task to prevent "Busy Form Waiting", which occurs
when we are waiting for a device instead performing other work.

Task Switching is managed by the "Schedule" entity.

 
Timer    |           |
 IRQ     |           |                            Schedule
  |      |           |                     ________________________
  |----->|   Task 1  |<------------------>|(1)Chooses a Ready Task |
  |      |           |                    |(2)Task Switching       |
  |      |___________|                    |________________________|   
  |      |           |                               /|\
  |      |           |                                | 
  |      |           |                                |
  |      |           |                                |
  |      |           |                                |      
  |----->|   Task 2  |<-------------------------------|
  |      |           |                                |
  |      |___________|                                |
  .      .     .     .                                .
  .      .     .     .                                .
  .      .     .     .                                .
  |      |           |                                |
  |      |           |                                |
  ------>|   Task N  |<--------------------------------
         |           |
         |___________| 
    
            Task Switching based on TimeSlice
 

A typical Timeslice for Linux is about 10 ms.

 
 |           |            
 |           | Resource    _____________________________
 |   Task 1  |----------->|(1) Enqueue Resource request |
 |           |  Access    |(2)  Mark Task as blocked    |
 |           |            |(3)  Choose a Ready Task     |
 |___________|            |(4)    Task Switching        |
                          |_____________________________|
                                       |
                                       |
 |           |                         |
 |           |                         |
 |   Task 2  |<-------------------------
 |           |  
 |           |
 |___________|
 
     Task Switching based on Waiting for a Resource
 

    Microkernel vs Monolithic OS

A Microkernel OS uses Tasks, not only for user mode processes, but also
as a real kernel manager, like Floppy-Task, HDD-Task, Net-Task and so
on. Some examples are Amoeba, and Mach.


      PROs and CONTROs of Microkernel OS

PROS:

    * OS is simpler to maintain because each Task manages a single kind
      of operation. So if you want to modify networking, you modify
      Net-Task (ideally, if it is not needed a structural update).

CONS:

    * Performances are worse than Monolithic OS, because you have to add
      2*TASK_SWITCH times (the first to enter the specific Task, the
      second to go out from it).


    Networking


In RX it:

   1. Manages handshake with low levels devices (like ethernet card or
      modem) receiving "frames" from them.
   2. Builds TCP/IP "packets" from "frames" (like Ethernet or PPP ones),
   3. Convers ''packets'' in ''sockets'' passing them to the right
      application (using port number) or
   4. Forwards packets to the right queue

frames         packets              sockets
NIC ---------> Kernel ----------> Application
                  |    packets
                  --------------> Forward
                        - RX - 

In TX stage it:

   1. Converts sockets or
   2. Queues datas into TCP/IP ''packets''
   3. Splits ''packets" into "frames" (like Ethernet or PPP ones)
   4. Sends ''frames'' using HW drivers

sockets       packets                     frames
Application ---------> Kernel ----------> NIC
              packets     /|\    
Forward  -------------------
                        - TX -  


    Virtual Memory


      Segmentation


            |       Stack        |
            |          |         |
            |         \|/        |
            |        Free        | 
            |         /|\        |     Segment <---> Process    
            |          |         |
            |        Heap        |
            | Data uninitialized |
            |  Data initialized  |
            |       Code         |
            |____________________|  
 
                   Segment  

We can say that a segment is the logical entity of an application, or
the image of the application in memory.

Linux uses only 4 segments for either Kernel and all Processes.


      Problems of Segmentation

 
                                 ____________________
                          ----->|                    |----->
                          | IN  |     Segment A      | OUT
 ____________________     |     |____________________|   
|                    |____|     |                    |   
|     Segment B      |          |     Segment B      |
|                    |____      |                    |   
|____________________|    |     |____________________|   
                          |     |     Segment C      |   
                          |     |____________________|
                          ----->|     Segment D      |-----> 
                            IN  |____________________| OUT 
 
                     Segmentation problem


In the diagram above, we want to get exit processes A, and D and enter
process B. As we can see there is enough space for B, but we cannot
split it in 2 pieces, so we CANNOT load it (memory out).

The reason this problem occurs is because pure segments are continuous
areas (because they are logical areas) and cannot be split.


      Pagination

 
             ____________________
            |     Page 1         |
            |____________________|
            |     Page 2         |
            |____________________| 
            |      ..            |     Segment <---> Process    
            |____________________|
            |     Page n         |
            |____________________|
            |                    |
            |____________________|
            |                    |
            |____________________|  
 
                   Segment  
 

Pagination splits memory in "n" pieces, each one with a fixed length.

A process may be loaded in one or more Pages. When memory is freed, all
pages are freed (see Segmentation Problem, before).

Pagination is also used for another important purpose, "Swapping". If a
page is not present in physical memory then it generates an EXCEPTION,
that will make the Kernel search for a new page in storage memory. This
mechanism allow OS to load more applications than the ones allowed by
physical memory only.


      Pagination Problem

             ____________________
   Page   X |     Process Y      |
            |____________________|
            |                    |
            |       WASTE        |
            |       SPACE        |
            |____________________|  
   
              Pagination Problem
 

In the diagram above, we can see what is wrong with the pagination
policy: when a Process Y loads into Page X, ALL memory space of the Page
is allocated, so the remaining space at the end of Page is wasted.


      Segmentation and Pagination

How can we solve segmentation and pagination problems? Using either 2
policies.

 
                                  |      ..            |
                                  |____________________|
                            ----->|      Page 1        |
                            |     |____________________|
                            |     |      ..            |
 ____________________       |     |____________________|
|                    |      |---->|      Page 2        |
|      Segment X     |  ----|     |____________________|
|                    |      |     |       ..           |
|____________________|      |     |____________________|
                            |     |       ..           |
                            |     |____________________|
                            |---->|      Page 3        |
                                  |____________________|
                                  |       ..           |
 

Process X, identified by Segment X, is split in 3 pieces and each of one
is loaded in a page.

We do not have:

   1. Segmentation problem: we allocate per Pages, so we also free Pages
      and we manage free space in an optimized way.
   2. Pagination problem: only last page wastes space, but we can decide
      to use very small pages, for example 4096 bytes length (losing at
      maximum 4096*N_Tasks bytes) and manage hierarchical paging (using
      2 or 3 levels of paging)

 
                          |         |           |         |
                          |         |   Offset2 |  Value  |
                          |         |        /|\|         |
                  Offset1 |         |-----    | |         |
                      /|\ |         |    |    | |         |
                       |  |         |    |   \|/|         | 
                       |  |         |    ------>|         |
                      \|/ |         |           |         |
 Base Paging Address ---->|         |           |         |
                          | ....... |           | ....... |
                          |         |           |         |    
 
                     Hierarchical Paging


    Linux Startup

We start the Linux kernel first from C code executed from
''startup_32:'' asm label:

|startup_32:
   |start_kernel
      |lock_kernel
      |trap_init
      |init_IRQ
      |sched_init
      |softirq_init
      |time_init
      |console_init 
      |#ifdef CONFIG_MODULES 
         |init_modules 
      |#endif 
      |kmem_cache_init 
      |sti 
      |calibrate_delay 
      |mem_init
      |kmem_cache_sizes_init
      |pgtable_cache_init
      |fork_init
      |proc_caches_init 
      |vfs_caches_init
      |buffer_init
      |page_cache_init
      |signals_init 
      |#ifdef CONFIG_PROC_FS 
        |proc_root_init 
      |#endif 
      |#if defined(CONFIG_SYSVIPC) 
         |ipc_init
      |#endif 
      |check_bugs      
      |smp_init
      |rest_init
         |kernel_thread
         |unlock_kernel
         |cpu_idle


The last function ''rest_init'' does the following:

   1. launches the kernel thread ''init''
   2. calls unlock_kernel
   3. makes the kernel run cpu_idle routine, that will be the idle loop
      executing when nothing is scheduled

In fact the start_kernel procedure never ends. It will execute cpu_idle
routine endlessly.

Follows ''init'' description, which is the first Kernel Thread:

|init
   |lock_kernel
   |do_basic_setup
      |mtrr_init
      |sysctl_init
      |pci_init
      |sock_init
      |start_context_thread
      |do_init_calls
         |(*call())-> kswapd_init
   |prepare_namespace
   |free_initmem
   |unlock_kernel


    Linux Multitasking

Task States

A Linux Task can be one of the following states (according to
[include/linux.h]):

   1. TASK_RUNNING, it means that it is in the "Ready List"
   2. TASK_INTERRUPTIBLE, task waiting for a signal or a resource (sleeping)
   3. TASK_UNINTERRUPTIBLE, task waiting for a resource (sleeping), it
      is in same "Wait Queue"
   4. TASK_ZOMBIE, task child without father
   5. TASK_STOPPED, task being debugged


      Graphical Interaction

       ______________     CPU Available     ______________
      |              |  ---------------->  |              |
      | TASK_RUNNING |                     | Real Running |  
      |______________|  <----------------  |______________|
                           CPU Busy
            |   /|\       
Waiting for |    | Resource  
 Resource   |    | Available             
           \|/   |      
    ______________________                     
   |                      |
   | TASK_INTERRUPTIBLE / |
   | TASK-UNINTERRUPTIBLE |
   |______________________|
 
                     Main Multitasking Flow


    Timeslice


      PIT 8253 Programming

Each 10 ms (depending on HZ value) an IRQ0 comes, which helps us in a
multitasking environment. This signal comes from PIC 8259 (in arch 386+)
which is connected to PIT 8253 with a clock of 1.19318 MHz.

    _____         ______        ______        
   | CPU |<------| 8259 |------| 8253 |
   |_____| IRQ0  |______|      |___/|\|
                                    |_____ CLK 1.193.180 MHz
          
// From include/asm/param.h
#ifndef HZ 
#define HZ 100 
#endif
 
// From include/asm/timex.h
#define CLOCK_TICK_RATE 1193180 /* Underlying HZ */
 
// From include/linux/timex.h
#define LATCH ((CLOCK_TICK_RATE + HZ/2) / HZ) /* For divider */
 
// From arch/i386/kernel/i8259.c
outb_p(0x34,0x43); /* binary, mode 2, LSB/MSB, ch 0 */ 
outb_p(LATCH & 0xff , 0x40); /* LSB */
outb(LATCH >> 8 , 0x40); /* MSB */
 

So we program 8253 (PIT, Programmable Interval Timer) with LATCH =
(1193180/HZ) = 11931.8 when HZ=100 (default). LATCH indicates the
frequency divisor factor.

LATCH = 11931.8 gives to 8253 (in output) a frequency of 1193180 /
11931.8 = 100 Hz, so period = 10ms

So Timeslice = 1/HZ.

With each Timeslice we temporarily interrupt current process execution
(without task switching), and we do some housekeeping work, after which
we'll return back to our previous process.


     Scheduler

The scheduler is the piece of code that chooses what Task has to be
executed at a given time.

Any time you need to change running task, select a candidate. Below is
the ''schedule [kernel/sched.c]'' function.


|schedule
   |do_softirq // manages post-IRQ work
   |for each task
      |calculate counter
   |prepare_to__switch // does anything
   |switch_mm // change Memory context (change CR3 value)
   |switch_to (assembler)
      |SAVE ESP
      |RESTORE future_ESP
      |SAVE EIP
      |push future_EIP *** push parameter as we did a call 
         |jmp __switch_to (it does some TSS work) 
         |__switch_to()
          ..
         |ret *** ret from call using future_EIP in place of call address
      new_task


    Task Switching


      When does Task switching occur?

Now we'll see how the Linux Kernel switchs from one task to another.

Task Switching is needed in many cases, such as the following:

    * when TimeSlice ends, we need to give access to some other task
    * when a task decide to access a resource, it sleeps for it, so we
      have to choose another task
    * when a task waits for a pipe, we have to give access to other
      task, which would write to pipe


      Task Switching

                           TASK SWITCHING TRICK
#define switch_to(prev,next,last) do {                                  \
        asm volatile("pushl %%esi\n\t"                                  \
                     "pushl %%edi\n\t"                                  \
                     "pushl %%ebp\n\t"                                  \
                     "movl %%esp,%0\n\t"        /* save ESP */          \
                     "movl %3,%%esp\n\t"        /* restore ESP */       \
                     "movl $1f,%1\n\t"          /* save EIP */          \
                     "pushl %4\n\t"             /* restore EIP */       \
                     "jmp __switch_to\n"                                \
                     "1:\t"                                             \
                     "popl %%ebp\n\t"                                   \
                     "popl %%edi\n\t"                                   \
                     "popl %%esi\n\t"                                   \
                     :"=m" (prev->thread.esp),"=m" (prev->thread.eip),  \
                      "=b" (last)                                       \
                     :"m" (next->thread.esp),"m" (next->thread.eip),    \
                      "a" (prev), "d" (next),                           \
                      "b" (prev));                                      \
} while (0)

Trick is here:

   1. ''pushl %4'' which puts future_EIP into the stack
   2. ''jmp __switch_to'' which execute ''__switch_to'' function, but in
      opposite of ''call'' we will return to valued pushed in point 1
      (so new Task!)

      U S E R   M O D E                 K E R N E L     M O D E

 |          |     |          |       |          |     |          |
 |          |     |          | Timer |          |     |          |
 |          |     |  Normal  |  IRQ  |          |     |          |
 |          |     |   Exec   |------>|Timer_Int.|     |          |
 |          |     |     |    |       | ..       |     |          |
 |          |     |    \|/   |       |schedule()|     | Task1 Ret|
 |          |     |          |       |_switch_to|<--  |  Address |
 |__________|     |__________|       |          |  |  |          |
                                     |          |  |S |          | 
Task1 Data/Stack   Task1 Code        |          |  |w |          |
                                     |          | T|i |          |
                                     |          | a|t |          |
 |          |     |          |       |          | s|c |          |
 |          |     |          | Timer |          | k|h |          |
 |          |     |  Normal  |  IRQ  |          |  |i |          | 
 |          |     |   Exec   |------>|Timer_Int.|  |n |          |
 |          |     |     |    |       | ..       |  |g |          |
 |          |     |    \|/   |       |schedule()|  |  | Task2 Ret|
 |          |     |          |       |_switch_to|<--  |  Address |
 |__________|     |__________|       |__________|     |__________|
 
Task2 Data/Stack   Task2 Code        Kernel Code  Kernel Data/Stack


    Fork


Fork is used to create another task. We start from a Task Parent, and we
copy many data structures to Task Child.

 
                               |         |
                               | ..      |
         Task Parent           |         |
         |         |           |         |
         |  fork   |---------->|  CREATE |   
         |         |          /|   NEW   |
         |_________|         / |   TASK  |
                            /  |         |
             ---           /   |         |
             ---          /    | ..      |
                         /     |         |
         Task Child     / 
         |         |   /
         |  fork   |<-/
         |         |
         |_________|
              
                       Fork SysCall


      What is not copied

New Task just created (''Task Child'') is almost equal to Parent (''Task
Parent''), there are only few differences:

   1. obviously PID
   2. child ''fork()'' will return 0, while parent ''fork()'' will
      return PID of Task Child, to distinguish them each other in User Mode
   3. All child data pages are marked ''READ + EXECUTE'', no "WRITE''
      (while parent has WRITE right for its own pages) so, when a write
      request comes, a ''Page Fault'' exception is generated which will
      create a new independent page: this mechanism is called ''Copy on
      Write'' (see Cap.10 for more).


    Linux Memory Management


Linux uses segmentation + pagination, which simplifies notation.


      Segments

Linux uses only 4 segments:

    * 2 segments (code and data/stack) for KERNEL SPACE from [0xC000
      0000] (3 GB) to [0xFFFF FFFF] (4 GB)
    * 2 segments (code and data/stack) for USER SPACE from [0] (0 GB) to
      [0xBFFF FFFF] (3 GB)

                               __
   4 GB--->|                |    |
           |     Kernel     |    |  Kernel Space (Code + Data/Stack)
           |                |  __|
   3 GB--->|----------------|  __
           |                |    |
           |                |    |
   2 GB--->|                |    |
           |     Tasks      |    |  User Space (Code + Data/Stack)
           |                |    |
   1 GB--->|                |    |
           |                |    |
           |________________|  __| 
 0x00000000
          Kernel/User Linear addresses
 

    Specific i386 implementation

Again, Linux implements Pagination using 3 Levels of Paging, but in i386
architecture only 2 of them are really used:

 
   ------------------------------------------------------------------
   L    I    N    E    A    R         A    D    D    R    E    S    S
   ------------------------------------------------------------------
        \___/                 \___/                     \_____/ 
 
     PD offset              PF offset                 Frame offset 
     [10 bits]              [10 bits]                 [12 bits]       
          |                     |                          |
          |                     |     -----------          |        
          |                     |     |  Value  |----------|---------
          |     |         |     |     |---------|   /|\    |        |
          |     |         |     |     |         |    |     |        |
          |     |         |     |     |         |    | Frame offset |
          |     |         |     |     |         |   \|/             |
          |     |         |     |     |---------|<------            |
          |     |         |     |     |         |      |            |
          |     |         |     |     |         |      | x 4096     |
          |     |         |  PF offset|_________|-------            |
          |     |         |       /|\ |         |                   |
      PD offset |_________|-----   |  |         |          _________|
            /|\ |         |    |   |  |         |          | 
             |  |         |    |  \|/ |         |         \|/
 _____       |  |         |    ------>|_________|   PHYSICAL ADDRESS 
|     |     \|/ |         |    x 4096 |         |
| CR3 |-------->|         |           |         |
|_____|         | ....... |           | ....... |
                |         |           |         |    
 
               Page Directory          Page File

                       Linux i386 Paging
 

    Memory Mapping

Linux manages Access Control with Pagination only, so different Tasks
will have the same segment addresses, but different CR3 (register used
to store Directory Page Address), pointing to different Page Entries.

In User mode a task cannot overcome 3 GB limit (0 x C0 00 00 00), so
only the first 768 page directory entries are meaningful (768*4MB = 3GB).

When a Task goes in Kernel Mode (by System call or by IRQ) the other 256
pages directory entries become important, and they point to the same
page files as all other Tasks (which are the same as the Kernel).

Note that Kernel (and only kernel) Linear Space is equal to Kernel
Physical Space, so:

 
            ________________ _____                    
           |Other KernelData|___  |  |                |
           |----------------|   | |__|                |
           |     Kernel     |\  |____|   Real Other   |
  3 GB --->|----------------| \      |   Kernel Data  |
           |                |\ \     |                |
           |              __|_\_\____|__   Real       |
           |      Tasks     |  \ \   |     Tasks      |
           |              __|___\_\__|__   Space      |
           |                |    \ \ |                |
           |                |     \ \|----------------|
           |                |      \ |Real KernelSpace|
           |________________|       \|________________|
      
           Logical Addresses          Physical Addresses
 

Linear Kernel Space corresponds to Physical Kernel Space translated 3 GB
down (in fact page tables are something like { "00000000", "00000001" },
so they operate no virtualization, they only report physical addresses
they take from linear ones).

Notice that you'll not have an "addresses conflict" between Kernel and
User spaces because we can manage physical addresses with Page Tables.


    Swap


Swap is managed by the kswapd daemon (kernel thread).

Swapping is needed when we have to access a page that is not in physical
memory.

Linux uses ''kswapd'' kernel thread to carry out this purpose. When the
Task receives a page fault exception we do the following:

 
 | Page Fault Exception
 | cause by all these conditions: 
 |   a-) User page 
 |   b-) Read or write access 
 |   c-) Page not present
 |
 |
 -----------> |do_page_fault
                 |handle_mm_fault
                    |pte_alloc 
                       |pte_alloc_one
                          |__get_free_page = __get_free_pages
                             |alloc_pages
                                |alloc_pages_pgdat
                                   |__alloc_pages
                                      |wakeup_kswapd // We wake up kernel thread kswapd
   
                   Page Fault ICA


    *


        Stack and Heap

    *


            Memory allocation

FF..        |                 | <-- bottom of the stack
       /|\  |                 |   | 
 higher |   |                 |   |   stack
 values |   |                 |  \|/  growing
            |                 |
XX..        |                 | <-- top of the stack [Stack Pointer]
            |                 |
            |                 |
            |                 |
00..        |_________________| <-- end of stack [Stack Segment]
                 
                   Stack

      Stack is used by functions for:

    * global variables
    * local variables
    * return address

For example, for a classical function:

 |int foo_function (parameter_1, parameter_2, ..., parameter_n) {
    |variable_1 declaration;
    |variable_2 declaration;
      ..
    |variable_n declaration;
   
    |// Body function
    |dynamic variable_1 declaration;
    |dynamic variable_2 declaration;
     ..
    |dynamic variable_n declaration;
   
    |// Code is inside Code Segment, not Data/Stack segment!
    
    |return (ret-type) value; // often it is inside some register, for i386 eax register is used.
 |}
we have

          |                       |
          | 1. parameter_1 pushed | \
    S     | 2. parameter_2 pushed |  | Before 
    T     | ...................   |  | the calling
    A     | n. parameter_n pushed | /
    C     | ** Return address **  | -- Calling
    K     | 1. local variable_1   | \ 
          | 2. local variable_2   |  | After
          | .................     |  | the calling
          | n. local variable_n   | /
          |                       | 
         ...                     ...   Free
         ...                     ...   stack
          |                       |
    H     | n. dynamic variable_n | \
    E     | ...................   |  | Allocated by
    A     | 2. dynamic variable_2 |  | malloc & kmalloc
    P     | 1. dynamic variable_1 | /
          |_______________________|
        
            Typical stack usage
 
Note: variables order can be different depending on hardware architecture.


    Copy_on_write

Copy_on_write is a mechanism used to reduce memory usage. It postpones
memory allocation until the memory is really needed.

For example, when a task executes the "fork()" system call (to create
another task), we still use the same memory pages as the parent, in read
only mode. When a task WRITES into the page, it causes an exception and
the page is copied and marked "rw" (read, write).

 
1-) Page X is shared between Task Parent and Task Child
 Task Parent
 |         | RO Access  ______
 |         |---------->|Page X|    
 |_________|           |______|
                          /|\
                           |
 Task Child                | 
 |         | RO Access     |  
 |         |----------------                
 |_________| 
 
 
2-) Write request
 Task Parent
 |         | RO Access  ______
 |         |---------->|Page X|    Trying to write
 |_________|           |______|
                          /|\
                           |
 Task Child                | 
 |         | RO Access     |  
 |         |----------------                
 |_________| 
 
 
3-) Final Configuration: Either Task Parent and Task Child have an independent copy of the Page, X and Y
 Task Parent
 |         | RW Access  ______
 |         |---------->|Page X|    
 |_________|           |______|
              
              
 Task Child
 |         | RW Access  ______
 |         |---------->|Page Y|    
 |_________|           |______|


    IRQ


IRQ is an asyncronous signal sent to microprocessor to advertise a
requested work is completed


                                 |<-->  IRQ(0) [Timer]
                                 |<-->  IRQ(1) [Device 1]
                                 | ..
                                 |<-->  IRQ(n) [Device n]
    _____________________________| 
     /|\      /|\          /|\
      |        |            |
     \|/      \|/          \|/
 
    Task(1)  Task(2) ..   Task(N)
              
             
             IRQ - Tasks Interaction Schema
  

   1. IRQ (i) occurs and Task(j) is interrupted
   2. IRQ(i)_handler is executed
   3. control backs to Task(j) interrupted


  F�jlrendszerek
  <http://www.nondot.org/sabre/os/files/FileSystems/ext2fs/ext2fs_toc.html#SEC4>


  Ext2
  <http://www.nondot.org/sabre/os/files/FileSystems/ext2fs/ext2fs_toc.html#SEC4>


  Superblock
  <http://www.nondot.org/sabre/os/files/FileSystems/ext2fs/ext2fs_toc.html#SEC4>

ext2fs superblock [include/linux/ext2_fs.h]:

struct ext2_super_block {
  unsigned long  s_inodes_count;
  unsigned long  s_blocks_count;
  unsigned long  s_r_blocks_count;
  unsigned long  s_free_blocks_count;
  unsigned long  s_free_inodes_count;
  unsigned long  s_first_data_block;
  unsigned long  s_log_block_size;
  long           s_log_frag_size;
  unsigned long  s_blocks_per_group;
  unsigned long  s_frags_per_group;
  unsigned long  s_inodes_per_group;
  unsigned long  s_mtime;
  unsigned long  s_wtime;
  unsigned short s_mnt_count;
  short          s_max_mnt_count;
  unsigned short s_magic;
  unsigned short s_state;
  unsigned short s_errors;
  unsigned short s_pad;
  unsigned long  s_lastcheck;
  unsigned long  s_checkinterval;
  unsigned long  s_reserved[238];
};

|s_inodes_count|
    the total number of inodes on the fs.

|s_blocks_count|
    the total number of blocks on the fs.

|s_r_blocks_count|
    the total number of blocks reserved for the exclusive use of the
    superuser.

|s_free_blocks_count|
    the total number of free blocks on the fs.

|s_free_inodes_count|
    the total number of free inodes on the fs.

|s_first_data_block|
    the position on the fs of the first data block. Usually, this is
    block number 1 for fs containing 1024 bytes blocks and is number 0
    for other fs.

|s_log_block_size|
    used to compute the logical block size in bytes. The logical block
    size is in fact |1024 << s_log_block_size|.

|s_log_frag_size|
    used to compute the logical fragment size. The logical fragment size
    is in fact |1024 << s_log_frag_size| if |s_log_frag_size| is
    positive and |1024 >> -s_log_frag_size| if |s_log_frag_size| is
    negative.

|s_blocks_per_group|
    the total number of blocks contained in a group.

|s_frags_per_group|
    the total number of fragments contained in a group.

|s_inodes_per_group|
    the total number of inodes contained in a group.

|s_mtime|
    the time at which the last mount of the fs was performed.

|s_wtime|
    the time at which the last write of the superblock on the fs was
    performed.

|s_mnt_count|
    the number of time the fs has been mounted in read-write mode
    without having been checked.

|s_max_mnt_count|
    the maximum number of time the fs may be mounted in read-write mode
    before a check must be done.

|s_magic|
    a magic number that permits the identification of the file system.
    It is |0xEF53| for a normal ext2fs and |0xEF51| for versions of
    ext2fs prior to 0.2b.

|s_state|
    the state of the file system. It contains an or'ed value of
    EXT2_VALID_FS (0x0001) which means: unmounted cleanly; and
    EXT2_ERROR_FS (0x0002) which means: errors detected by the kernel code.

|s_errors|
    indicates what operation to perform when an error occurs. See
    section Error Handling
    <http://www.nondot.org/sabre/os/files/FileSystems/ext2fs/ext2fs_10.html#SEC10>


|s_pad|
    unused.

|s_lastcheck|
    the time of the last check performed on the fs.

|s_checkinterval|
    the maximum possible time between checks on the fs.

||


    Inodes
    <http://www.nondot.org/sabre/os/files/FileSystems/ext2fs/ext2fs_toc.html#SEC7>

An inode uniquely describes a file

|i_mode|
    type of file (character, block, link, etc.) and access rights on the
    file.

|i_uid|
    uid of the owner of the file.

|i_size|
    logical size in bytes.

|i_atime|
    last time the file was accessed.

|i_ctime|
    last time the inode information of the file was changed.

|i_mtime|
    last time the file content was modified.

|i_dtime|
    when this file was deleted.

|i_gid|
    gid of the file.

|i_links_count|
    number of links pointing to this file.

|i_blocks|
    number of blocks allocated to this file counted in 512 bytes units


|i_version|
    version of the file (used by NFS).

|i_file_acl|
    control access list of the file (not used yet).

|i_dir_acl|
    control access list of the directory (not used yet).

|i_faddr|
    block where the fragment of the file resides.

|i_frag|
    number of the fragment in the block.

|i_size|
    size of the fragment.


|i_flags|
    flags:

|EXT2_SECRM_FL 0x0001|
    secure deletion

|EXT2_UNRM_FL 0x0002|
    undelete

|EXT2_COMPR_FL 0x0004|
    compress file

|EXT2_SYNC_FL 0x0008|
    synchronous updates.

||


|EXT2_BAD_INO 1|
    a file containing the list of bad blocks on the file system.
|

|


|i_block|
    pointers to blocks