Kernel OS (Operating System) Linux - XWindow Windows (Win9x, WinME, WinNT, Win2K, WinXP) User Mode and Kernel Mode 680000 Operative modes 1. Kernel Mode: the machine operates with critical data structure, direct hardware (IN/OUT or memory mapped), direct memory, IRQ, DMA, and so on. 2. User Mode: users can run applications. | Applications /|\ | ______________ | | | User Mode | | | ______________ | | | | Implementation | _______ _______ | Abstraction Detail | | Kernel Mode | | | _______________ | | | | | | | | | | \|/ Hardware | Kernel Mode "prevents" User Mode applications from damaging the system or its features. Modern microprocessors implement in hardware at least 2 different states. For example under Intel, 4 states determine the PL (Privilege Level). It is possible to use 0,1,2,3 states, with 0 used in Kernel Mode. Unix OS requires only 2 privilege levels, and we will use such a paradigm as point of reference. Switching from User Mode to Kernel Mode 1. System Call: code living in Kernel Mode 2. IRQ (or exception) : the IRQ an IRQ handler (or exception handler) is called System Calls System calls are like special functions that manage OS routines which live in Kernel Mode. A system call can be called when we: * access an I/O device or a file (like read or write) * need to access privileged information (like pid, changing scheduling policy or other information) * need to change execution context (like forking or executing some other application) * need to execute a particular command (like ''chdir'', ''kill", ''brk'', or ''signal'') | | ------->| System Call i | (Accessing Devices) | | | | [sys_read()] | | ... | | | | | system_call(i) |-------- | | | [read()] | | | | ... | | | | system_call(j) |-------- | | | [get_pid()] | | | | | ... | ------->| System Call j | (Accessing kernel data structures) | | | [sys_getpid()]| | | USER MODE KERNEL MODE Unix System Calls Working System calls are almost the only interface used by User Mode to talk with low level resources (hardware). The only exception to this statement is when a process uses ''ioperm'' system call. In this case a device can be accessed directly by User Mode process (IRQs cannot be used). System Calls under Linux Kernel 2.4.17, from [ arch/i386/kernel/entry.S ] .long SYMBOL_NAME(sys_ni_syscall) /* 0 - old "setup()" system call*/ .long SYMBOL_NAME(sys_exit) .long SYMBOL_NAME(sys_fork) .long SYMBOL_NAME(sys_read) .long SYMBOL_NAME(sys_write) .long SYMBOL_NAME(sys_open) /* 5 */ .long SYMBOL_NAME(sys_close) .long SYMBOL_NAME(sys_waitpid) .long SYMBOL_NAME(sys_creat) .long SYMBOL_NAME(sys_link) .long SYMBOL_NAME(sys_unlink) /* 10 */ .long SYMBOL_NAME(sys_execve) .long SYMBOL_NAME(sys_chdir) .long SYMBOL_NAME(sys_time) .long SYMBOL_NAME(sys_mknod) .long SYMBOL_NAME(sys_chmod) /* 15 */ .long SYMBOL_NAME(sys_lchown16) .long SYMBOL_NAME(sys_ni_syscall) /* old break syscall holder */ .long SYMBOL_NAME(sys_stat) .long SYMBOL_NAME(sys_lseek) .long SYMBOL_NAME(sys_getpid) /* 20 */ .long SYMBOL_NAME(sys_mount) .long SYMBOL_NAME(sys_oldumount) .long SYMBOL_NAME(sys_setuid16) .long SYMBOL_NAME(sys_getuid16) .long SYMBOL_NAME(sys_stime) /* 25 */ .long SYMBOL_NAME(sys_ptrace) .long SYMBOL_NAME(sys_alarm) .long SYMBOL_NAME(sys_fstat) .long SYMBOL_NAME(sys_pause) .long SYMBOL_NAME(sys_utime) /* 30 */ .long SYMBOL_NAME(sys_ni_syscall) /* old stty syscall holder */ .long SYMBOL_NAME(sys_ni_syscall) /* old gtty syscall holder */ .long SYMBOL_NAME(sys_access) .long SYMBOL_NAME(sys_nice) .long SYMBOL_NAME(sys_ni_syscall) /* 35 */ /* old ftime syscall holder */ .long SYMBOL_NAME(sys_sync) .long SYMBOL_NAME(sys_kill) .long SYMBOL_NAME(sys_rename) .long SYMBOL_NAME(sys_mkdir) .long SYMBOL_NAME(sys_rmdir) /* 40 */ .long SYMBOL_NAME(sys_dup) .long SYMBOL_NAME(sys_pipe) .long SYMBOL_NAME(sys_times) .long SYMBOL_NAME(sys_ni_syscall) /* old prof syscall holder */ .long SYMBOL_NAME(sys_brk) /* 45 */ .long SYMBOL_NAME(sys_setgid16) .long SYMBOL_NAME(sys_getgid16) .long SYMBOL_NAME(sys_signal) .long SYMBOL_NAME(sys_geteuid16) .long SYMBOL_NAME(sys_getegid16) /* 50 */ .long SYMBOL_NAME(sys_acct) .long SYMBOL_NAME(sys_umount) /* recycled never used phys() */ .long SYMBOL_NAME(sys_ni_syscall) /* old lock syscall holder */ .long SYMBOL_NAME(sys_ioctl) .long SYMBOL_NAME(sys_fcntl) /* 55 */ .long SYMBOL_NAME(sys_ni_syscall) /* old mpx syscall holder */ .long SYMBOL_NAME(sys_setpgid) .long SYMBOL_NAME(sys_ni_syscall) /* old ulimit syscall holder */ .long SYMBOL_NAME(sys_olduname) .long SYMBOL_NAME(sys_umask) /* 60 */ .long SYMBOL_NAME(sys_chroot) .long SYMBOL_NAME(sys_ustat) .long SYMBOL_NAME(sys_dup2) .long SYMBOL_NAME(sys_getppid) .long SYMBOL_NAME(sys_getpgrp) /* 65 */ .long SYMBOL_NAME(sys_setsid) .long SYMBOL_NAME(sys_sigaction) .long SYMBOL_NAME(sys_sgetmask) .long SYMBOL_NAME(sys_ssetmask) .long SYMBOL_NAME(sys_setreuid16) /* 70 */ .long SYMBOL_NAME(sys_setregid16) .long SYMBOL_NAME(sys_sigsuspend) .long SYMBOL_NAME(sys_sigpending) .long SYMBOL_NAME(sys_sethostname) .long SYMBOL_NAME(sys_setrlimit) /* 75 */ .long SYMBOL_NAME(sys_old_getrlimit) .long SYMBOL_NAME(sys_getrusage) .long SYMBOL_NAME(sys_gettimeofday) .long SYMBOL_NAME(sys_settimeofday) .long SYMBOL_NAME(sys_getgroups16) /* 80 */ .long SYMBOL_NAME(sys_setgroups16) .long SYMBOL_NAME(old_select) .long SYMBOL_NAME(sys_symlink) .long SYMBOL_NAME(sys_lstat) .long SYMBOL_NAME(sys_readlink) /* 85 */ .long SYMBOL_NAME(sys_uselib) .long SYMBOL_NAME(sys_swapon) .long SYMBOL_NAME(sys_reboot) .long SYMBOL_NAME(old_readdir) .long SYMBOL_NAME(old_mmap) /* 90 */ .long SYMBOL_NAME(sys_munmap) .long SYMBOL_NAME(sys_truncate) .long SYMBOL_NAME(sys_ftruncate) .long SYMBOL_NAME(sys_fchmod) .long SYMBOL_NAME(sys_fchown16) /* 95 */ .long SYMBOL_NAME(sys_getpriority) .long SYMBOL_NAME(sys_setpriority) .long SYMBOL_NAME(sys_ni_syscall) /* old profil syscall holder */ .long SYMBOL_NAME(sys_statfs) .long SYMBOL_NAME(sys_fstatfs) /* 100 */ .long SYMBOL_NAME(sys_ioperm) .long SYMBOL_NAME(sys_socketcall) .long SYMBOL_NAME(sys_syslog) .long SYMBOL_NAME(sys_setitimer) .long SYMBOL_NAME(sys_getitimer) /* 105 */ .long SYMBOL_NAME(sys_newstat) .long SYMBOL_NAME(sys_newlstat) .long SYMBOL_NAME(sys_newfstat) .long SYMBOL_NAME(sys_uname) .long SYMBOL_NAME(sys_iopl) /* 110 */ .long SYMBOL_NAME(sys_vhangup) .long SYMBOL_NAME(sys_ni_syscall) /* old "idle" system call */ .long SYMBOL_NAME(sys_vm86old) .long SYMBOL_NAME(sys_wait4) .long SYMBOL_NAME(sys_swapoff) /* 115 */ .long SYMBOL_NAME(sys_sysinfo) .long SYMBOL_NAME(sys_ipc) .long SYMBOL_NAME(sys_fsync) .long SYMBOL_NAME(sys_sigreturn) .long SYMBOL_NAME(sys_clone) /* 120 */ .long SYMBOL_NAME(sys_setdomainname) .long SYMBOL_NAME(sys_newuname) .long SYMBOL_NAME(sys_modify_ldt) .long SYMBOL_NAME(sys_adjtimex) .long SYMBOL_NAME(sys_mprotect) /* 125 */ .long SYMBOL_NAME(sys_sigprocmask) .long SYMBOL_NAME(sys_create_module) .long SYMBOL_NAME(sys_init_module) .long SYMBOL_NAME(sys_delete_module) .long SYMBOL_NAME(sys_get_kernel_syms) /* 130 */ .long SYMBOL_NAME(sys_quotactl) .long SYMBOL_NAME(sys_getpgid) .long SYMBOL_NAME(sys_fchdir) .long SYMBOL_NAME(sys_bdflush) .long SYMBOL_NAME(sys_sysfs) /* 135 */ .long SYMBOL_NAME(sys_personality) .long SYMBOL_NAME(sys_ni_syscall) /* for afs_syscall */ .long SYMBOL_NAME(sys_setfsuid16) .long SYMBOL_NAME(sys_setfsgid16) .long SYMBOL_NAME(sys_llseek) /* 140 */ .long SYMBOL_NAME(sys_getdents) .long SYMBOL_NAME(sys_select) .long SYMBOL_NAME(sys_flock) .long SYMBOL_NAME(sys_msync) .long SYMBOL_NAME(sys_readv) /* 145 */ .long SYMBOL_NAME(sys_writev) .long SYMBOL_NAME(sys_getsid) .long SYMBOL_NAME(sys_fdatasync) .long SYMBOL_NAME(sys_sysctl) .long SYMBOL_NAME(sys_mlock) /* 150 */ .long SYMBOL_NAME(sys_munlock) .long SYMBOL_NAME(sys_mlockall) .long SYMBOL_NAME(sys_munlockall) .long SYMBOL_NAME(sys_sched_setparam) .long SYMBOL_NAME(sys_sched_getparam) /* 155 */ .long SYMBOL_NAME(sys_sched_setscheduler) .long SYMBOL_NAME(sys_sched_getscheduler) .long SYMBOL_NAME(sys_sched_yield) .long SYMBOL_NAME(sys_sched_get_priority_max) .long SYMBOL_NAME(sys_sched_get_priority_min) /* 160 */ .long SYMBOL_NAME(sys_sched_rr_get_interval) .long SYMBOL_NAME(sys_nanosleep) .long SYMBOL_NAME(sys_mremap) .long SYMBOL_NAME(sys_setresuid16) .long SYMBOL_NAME(sys_getresuid16) /* 165 */ .long SYMBOL_NAME(sys_vm86) .long SYMBOL_NAME(sys_query_module) .long SYMBOL_NAME(sys_poll) .long SYMBOL_NAME(sys_nfsservctl) .long SYMBOL_NAME(sys_setresgid16) /* 170 */ .long SYMBOL_NAME(sys_getresgid16) .long SYMBOL_NAME(sys_prctl) .long SYMBOL_NAME(sys_rt_sigreturn) .long SYMBOL_NAME(sys_rt_sigaction) .long SYMBOL_NAME(sys_rt_sigprocmask) /* 175 */ .long SYMBOL_NAME(sys_rt_sigpending) .long SYMBOL_NAME(sys_rt_sigtimedwait) .long SYMBOL_NAME(sys_rt_sigqueueinfo) .long SYMBOL_NAME(sys_rt_sigsuspend) .long SYMBOL_NAME(sys_pread) /* 180 */ .long SYMBOL_NAME(sys_pwrite) .long SYMBOL_NAME(sys_chown16) .long SYMBOL_NAME(sys_getcwd) .long SYMBOL_NAME(sys_capget) .long SYMBOL_NAME(sys_capset) /* 185 */ .long SYMBOL_NAME(sys_sigaltstack) .long SYMBOL_NAME(sys_sendfile) .long SYMBOL_NAME(sys_ni_syscall) /* streams1 */ .long SYMBOL_NAME(sys_ni_syscall) /* streams2 */ .long SYMBOL_NAME(sys_vfork) /* 190 */ .long SYMBOL_NAME(sys_getrlimit) .long SYMBOL_NAME(sys_mmap2) .long SYMBOL_NAME(sys_truncate64) .long SYMBOL_NAME(sys_ftruncate64) .long SYMBOL_NAME(sys_stat64) /* 195 */ .long SYMBOL_NAME(sys_lstat64) .long SYMBOL_NAME(sys_fstat64) .long SYMBOL_NAME(sys_lchown) .long SYMBOL_NAME(sys_getuid) .long SYMBOL_NAME(sys_getgid) /* 200 */ .long SYMBOL_NAME(sys_geteuid) .long SYMBOL_NAME(sys_getegid) .long SYMBOL_NAME(sys_setreuid) .long SYMBOL_NAME(sys_setregid) .long SYMBOL_NAME(sys_getgroups) /* 205 */ .long SYMBOL_NAME(sys_setgroups) .long SYMBOL_NAME(sys_fchown) .long SYMBOL_NAME(sys_setresuid) .long SYMBOL_NAME(sys_getresuid) .long SYMBOL_NAME(sys_setresgid) /* 210 */ .long SYMBOL_NAME(sys_getresgid) .long SYMBOL_NAME(sys_chown) .long SYMBOL_NAME(sys_setuid) .long SYMBOL_NAME(sys_setgid) .long SYMBOL_NAME(sys_setfsuid) /* 215 */ .long SYMBOL_NAME(sys_setfsgid) .long SYMBOL_NAME(sys_pivot_root) .long SYMBOL_NAME(sys_mincore) .long SYMBOL_NAME(sys_madvise) .long SYMBOL_NAME(sys_getdents64) /* 220 */ .long SYMBOL_NAME(sys_fcntl64) .long SYMBOL_NAME(sys_ni_syscall) /* reserved for TUX */ .long SYMBOL_NAME(sys_ni_syscall) /* Reserved for Security */ .long SYMBOL_NAME(sys_gettid) .long SYMBOL_NAME(sys_readahead) /* 225 */ IRQ Event When an IRQ comes, the task that is running is interrupted in order to service the IRQ Handler. After the IRQ is handled, control returns backs exactly to point of interrupt, like nothing happened. Running Task |-----------| (3) NORMAL | | | [break execution] IRQ Handler EXECUTION (1)| | | ------------->|---------| | \|/ | | | does | IRQ (2)---->| .. |-----> | some | | | |<----- | work | BACK TO | | | | | ..(4). | NORMAL (6)| \|/ | <-------------|_________| EXECUTION |___________| [return to code] (5) USER MODE KERNEL MODE User->Kernel Mode Transition caused by IRQ event The numbered steps below refer to the sequence of events in the diagram above: 1. Process is executing 2. IRQ comes while the task is running. 3. Task is interrupted to call an "Interrupt handler". 4. The "Interrupt handler" code is executed. 5. Control returns back to task user mode (as if nothing happened) 6. Process returns back to normal execution Special interest has the Timer IRQ, coming every TIMER ms to manage: 1. Alarms 2. System and task counters (used by schedule to decide when stop a process or for accounting) 3. Multitasking based on wake up mechanism after TIMESLICE time. 3.4 Multitasking Mechanism The key point of modern OSs is the "Task". The Task is an application running in memory sharing all resources (included CPU and Memory) with other Tasks. This "resource sharing" is managed by the "Multitasking Mechanism". The Multitasking Mechanism switches from one task to another after a "timeslice" time. Users have the "illusion" that they own all resources. We can also imagine a single user scenario, where a user can have the "illusion" of running many tasks at the same time. To implement this multitasking, the task uses "the state" variable, which can be: 1. READY, ready for execution 2. BLOCKED, waiting for a resource The task state is managed by its presence in a relative list: READY list and BLOCKED list. Task Switching The movement from one task to another is called ''Task Switching''. many computers have a hardware instruction which automatically performs this operation. Task Switching occurs in the following cases: 1. After Timeslice ends: we need to schedule a "Ready for execution" task and give it access. 2. When a Task has to wait for a device: we need to schedule a new task and switch to it * * We schedule another task to prevent "Busy Form Waiting", which occurs when we are waiting for a device instead performing other work. Task Switching is managed by the "Schedule" entity. Timer | | IRQ | | Schedule | | | ________________________ |----->| Task 1 |<------------------>|(1)Chooses a Ready Task | | | | |(2)Task Switching | | |___________| |________________________| | | | /|\ | | | | | | | | | | | | | | | | |----->| Task 2 |<-------------------------------| | | | | | |___________| | . . . . . . . . . . . . . . . | | | | | | | | ------>| Task N |<-------------------------------- | | |___________| Task Switching based on TimeSlice A typical Timeslice for Linux is about 10 ms. | | | | Resource _____________________________ | Task 1 |----------->|(1) Enqueue Resource request | | | Access |(2) Mark Task as blocked | | | |(3) Choose a Ready Task | |___________| |(4) Task Switching | |_____________________________| | | | | | | | | | Task 2 |<------------------------- | | | | |___________| Task Switching based on Waiting for a Resource Microkernel vs Monolithic OS A Microkernel OS uses Tasks, not only for user mode processes, but also as a real kernel manager, like Floppy-Task, HDD-Task, Net-Task and so on. Some examples are Amoeba, and Mach. PROs and CONTROs of Microkernel OS PROS: * OS is simpler to maintain because each Task manages a single kind of operation. So if you want to modify networking, you modify Net-Task (ideally, if it is not needed a structural update). CONS: * Performances are worse than Monolithic OS, because you have to add 2*TASK_SWITCH times (the first to enter the specific Task, the second to go out from it). Networking In RX it: 1. Manages handshake with low levels devices (like ethernet card or modem) receiving "frames" from them. 2. Builds TCP/IP "packets" from "frames" (like Ethernet or PPP ones), 3. Convers ''packets'' in ''sockets'' passing them to the right application (using port number) or 4. Forwards packets to the right queue frames packets sockets NIC ---------> Kernel ----------> Application | packets --------------> Forward - RX - In TX stage it: 1. Converts sockets or 2. Queues datas into TCP/IP ''packets'' 3. Splits ''packets" into "frames" (like Ethernet or PPP ones) 4. Sends ''frames'' using HW drivers sockets packets frames Application ---------> Kernel ----------> NIC packets /|\ Forward ------------------- - TX - Virtual Memory Segmentation | Stack | | | | | \|/ | | Free | | /|\ | Segment <---> Process | | | | Heap | | Data uninitialized | | Data initialized | | Code | |____________________| Segment We can say that a segment is the logical entity of an application, or the image of the application in memory. Linux uses only 4 segments for either Kernel and all Processes. Problems of Segmentation ____________________ ----->| |-----> | IN | Segment A | OUT ____________________ | |____________________| | |____| | | | Segment B | | Segment B | | |____ | | |____________________| | |____________________| | | Segment C | | |____________________| ----->| Segment D |-----> IN |____________________| OUT Segmentation problem In the diagram above, we want to get exit processes A, and D and enter process B. As we can see there is enough space for B, but we cannot split it in 2 pieces, so we CANNOT load it (memory out). The reason this problem occurs is because pure segments are continuous areas (because they are logical areas) and cannot be split. Pagination ____________________ | Page 1 | |____________________| | Page 2 | |____________________| | .. | Segment <---> Process |____________________| | Page n | |____________________| | | |____________________| | | |____________________| Segment Pagination splits memory in "n" pieces, each one with a fixed length. A process may be loaded in one or more Pages. When memory is freed, all pages are freed (see Segmentation Problem, before). Pagination is also used for another important purpose, "Swapping". If a page is not present in physical memory then it generates an EXCEPTION, that will make the Kernel search for a new page in storage memory. This mechanism allow OS to load more applications than the ones allowed by physical memory only. Pagination Problem ____________________ Page X | Process Y | |____________________| | | | WASTE | | SPACE | |____________________| Pagination Problem In the diagram above, we can see what is wrong with the pagination policy: when a Process Y loads into Page X, ALL memory space of the Page is allocated, so the remaining space at the end of Page is wasted. Segmentation and Pagination How can we solve segmentation and pagination problems? Using either 2 policies. | .. | |____________________| ----->| Page 1 | | |____________________| | | .. | ____________________ | |____________________| | | |---->| Page 2 | | Segment X | ----| |____________________| | | | | .. | |____________________| | |____________________| | | .. | | |____________________| |---->| Page 3 | |____________________| | .. | Process X, identified by Segment X, is split in 3 pieces and each of one is loaded in a page. We do not have: 1. Segmentation problem: we allocate per Pages, so we also free Pages and we manage free space in an optimized way. 2. Pagination problem: only last page wastes space, but we can decide to use very small pages, for example 4096 bytes length (losing at maximum 4096*N_Tasks bytes) and manage hierarchical paging (using 2 or 3 levels of paging) | | | | | | Offset2 | Value | | | /|\| | Offset1 | |----- | | | /|\ | | | | | | | | | | \|/| | | | | ------>| | \|/ | | | | Base Paging Address ---->| | | | | ....... | | ....... | | | | | Hierarchical Paging Linux Startup We start the Linux kernel first from C code executed from ''startup_32:'' asm label: |startup_32: |start_kernel |lock_kernel |trap_init |init_IRQ |sched_init |softirq_init |time_init |console_init |#ifdef CONFIG_MODULES |init_modules |#endif |kmem_cache_init |sti |calibrate_delay |mem_init |kmem_cache_sizes_init |pgtable_cache_init |fork_init |proc_caches_init |vfs_caches_init |buffer_init |page_cache_init |signals_init |#ifdef CONFIG_PROC_FS |proc_root_init |#endif |#if defined(CONFIG_SYSVIPC) |ipc_init |#endif |check_bugs |smp_init |rest_init |kernel_thread |unlock_kernel |cpu_idle The last function ''rest_init'' does the following: 1. launches the kernel thread ''init'' 2. calls unlock_kernel 3. makes the kernel run cpu_idle routine, that will be the idle loop executing when nothing is scheduled In fact the start_kernel procedure never ends. It will execute cpu_idle routine endlessly. Follows ''init'' description, which is the first Kernel Thread: |init |lock_kernel |do_basic_setup |mtrr_init |sysctl_init |pci_init |sock_init |start_context_thread |do_init_calls |(*call())-> kswapd_init |prepare_namespace |free_initmem |unlock_kernel Linux Multitasking Task States A Linux Task can be one of the following states (according to [include/linux.h]): 1. TASK_RUNNING, it means that it is in the "Ready List" 2. TASK_INTERRUPTIBLE, task waiting for a signal or a resource (sleeping) 3. TASK_UNINTERRUPTIBLE, task waiting for a resource (sleeping), it is in same "Wait Queue" 4. TASK_ZOMBIE, task child without father 5. TASK_STOPPED, task being debugged Graphical Interaction ______________ CPU Available ______________ | | ----------------> | | | TASK_RUNNING | | Real Running | |______________| <---------------- |______________| CPU Busy | /|\ Waiting for | | Resource Resource | | Available \|/ | ______________________ | | | TASK_INTERRUPTIBLE / | | TASK-UNINTERRUPTIBLE | |______________________| Main Multitasking Flow Timeslice PIT 8253 Programming Each 10 ms (depending on HZ value) an IRQ0 comes, which helps us in a multitasking environment. This signal comes from PIC 8259 (in arch 386+) which is connected to PIT 8253 with a clock of 1.19318 MHz. _____ ______ ______ | CPU |<------| 8259 |------| 8253 | |_____| IRQ0 |______| |___/|\| |_____ CLK 1.193.180 MHz // From include/asm/param.h #ifndef HZ #define HZ 100 #endif // From include/asm/timex.h #define CLOCK_TICK_RATE 1193180 /* Underlying HZ */ // From include/linux/timex.h #define LATCH ((CLOCK_TICK_RATE + HZ/2) / HZ) /* For divider */ // From arch/i386/kernel/i8259.c outb_p(0x34,0x43); /* binary, mode 2, LSB/MSB, ch 0 */ outb_p(LATCH & 0xff , 0x40); /* LSB */ outb(LATCH >> 8 , 0x40); /* MSB */ So we program 8253 (PIT, Programmable Interval Timer) with LATCH = (1193180/HZ) = 11931.8 when HZ=100 (default). LATCH indicates the frequency divisor factor. LATCH = 11931.8 gives to 8253 (in output) a frequency of 1193180 / 11931.8 = 100 Hz, so period = 10ms So Timeslice = 1/HZ. With each Timeslice we temporarily interrupt current process execution (without task switching), and we do some housekeeping work, after which we'll return back to our previous process. Scheduler The scheduler is the piece of code that chooses what Task has to be executed at a given time. Any time you need to change running task, select a candidate. Below is the ''schedule [kernel/sched.c]'' function. |schedule |do_softirq // manages post-IRQ work |for each task |calculate counter |prepare_to__switch // does anything |switch_mm // change Memory context (change CR3 value) |switch_to (assembler) |SAVE ESP |RESTORE future_ESP |SAVE EIP |push future_EIP *** push parameter as we did a call |jmp __switch_to (it does some TSS work) |__switch_to() .. |ret *** ret from call using future_EIP in place of call address new_task Task Switching When does Task switching occur? Now we'll see how the Linux Kernel switchs from one task to another. Task Switching is needed in many cases, such as the following: * when TimeSlice ends, we need to give access to some other task * when a task decide to access a resource, it sleeps for it, so we have to choose another task * when a task waits for a pipe, we have to give access to other task, which would write to pipe Task Switching TASK SWITCHING TRICK #define switch_to(prev,next,last) do { \ asm volatile("pushl %%esi\n\t" \ "pushl %%edi\n\t" \ "pushl %%ebp\n\t" \ "movl %%esp,%0\n\t" /* save ESP */ \ "movl %3,%%esp\n\t" /* restore ESP */ \ "movl $1f,%1\n\t" /* save EIP */ \ "pushl %4\n\t" /* restore EIP */ \ "jmp __switch_to\n" \ "1:\t" \ "popl %%ebp\n\t" \ "popl %%edi\n\t" \ "popl %%esi\n\t" \ :"=m" (prev->thread.esp),"=m" (prev->thread.eip), \ "=b" (last) \ :"m" (next->thread.esp),"m" (next->thread.eip), \ "a" (prev), "d" (next), \ "b" (prev)); \ } while (0) Trick is here: 1. ''pushl %4'' which puts future_EIP into the stack 2. ''jmp __switch_to'' which execute ''__switch_to'' function, but in opposite of ''call'' we will return to valued pushed in point 1 (so new Task!) U S E R M O D E K E R N E L M O D E | | | | | | | | | | | | Timer | | | | | | | Normal | IRQ | | | | | | | Exec |------>|Timer_Int.| | | | | | | | | .. | | | | | | \|/ | |schedule()| | Task1 Ret| | | | | |_switch_to|<-- | Address | |__________| |__________| | | | | | | | |S | | Task1 Data/Stack Task1 Code | | |w | | | | T|i | | | | a|t | | | | | | | | s|c | | | | | | Timer | | k|h | | | | | Normal | IRQ | | |i | | | | | Exec |------>|Timer_Int.| |n | | | | | | | | .. | |g | | | | | \|/ | |schedule()| | | Task2 Ret| | | | | |_switch_to|<-- | Address | |__________| |__________| |__________| |__________| Task2 Data/Stack Task2 Code Kernel Code Kernel Data/Stack Fork Fork is used to create another task. We start from a Task Parent, and we copy many data structures to Task Child. | | | .. | Task Parent | | | | | | | fork |---------->| CREATE | | | /| NEW | |_________| / | TASK | / | | --- / | | --- / | .. | / | | Task Child / | | / | fork |<-/ | | |_________| Fork SysCall What is not copied New Task just created (''Task Child'') is almost equal to Parent (''Task Parent''), there are only few differences: 1. obviously PID 2. child ''fork()'' will return 0, while parent ''fork()'' will return PID of Task Child, to distinguish them each other in User Mode 3. All child data pages are marked ''READ + EXECUTE'', no "WRITE'' (while parent has WRITE right for its own pages) so, when a write request comes, a ''Page Fault'' exception is generated which will create a new independent page: this mechanism is called ''Copy on Write'' (see Cap.10 for more). Linux Memory Management Linux uses segmentation + pagination, which simplifies notation. Segments Linux uses only 4 segments: * 2 segments (code and data/stack) for KERNEL SPACE from [0xC000 0000] (3 GB) to [0xFFFF FFFF] (4 GB) * 2 segments (code and data/stack) for USER SPACE from [0] (0 GB) to [0xBFFF FFFF] (3 GB) __ 4 GB--->| | | | Kernel | | Kernel Space (Code + Data/Stack) | | __| 3 GB--->|----------------| __ | | | | | | 2 GB--->| | | | Tasks | | User Space (Code + Data/Stack) | | | 1 GB--->| | | | | | |________________| __| 0x00000000 Kernel/User Linear addresses Specific i386 implementation Again, Linux implements Pagination using 3 Levels of Paging, but in i386 architecture only 2 of them are really used: ------------------------------------------------------------------ L I N E A R A D D R E S S ------------------------------------------------------------------ \___/ \___/ \_____/ PD offset PF offset Frame offset [10 bits] [10 bits] [12 bits] | | | | | ----------- | | | | Value |----------|--------- | | | | |---------| /|\ | | | | | | | | | | | | | | | | | | Frame offset | | | | | | | \|/ | | | | | |---------|<------ | | | | | | | | | | | | | | | | x 4096 | | | | PF offset|_________|------- | | | | /|\ | | | PD offset |_________|----- | | | _________| /|\ | | | | | | | | | | | \|/ | | \|/ _____ | | | ------>|_________| PHYSICAL ADDRESS | | \|/ | | x 4096 | | | CR3 |-------->| | | | |_____| | ....... | | ....... | | | | | Page Directory Page File Linux i386 Paging Memory Mapping Linux manages Access Control with Pagination only, so different Tasks will have the same segment addresses, but different CR3 (register used to store Directory Page Address), pointing to different Page Entries. In User mode a task cannot overcome 3 GB limit (0 x C0 00 00 00), so only the first 768 page directory entries are meaningful (768*4MB = 3GB). When a Task goes in Kernel Mode (by System call or by IRQ) the other 256 pages directory entries become important, and they point to the same page files as all other Tasks (which are the same as the Kernel). Note that Kernel (and only kernel) Linear Space is equal to Kernel Physical Space, so: ________________ _____ |Other KernelData|___ | | | |----------------| | |__| | | Kernel |\ |____| Real Other | 3 GB --->|----------------| \ | Kernel Data | | |\ \ | | | __|_\_\____|__ Real | | Tasks | \ \ | Tasks | | __|___\_\__|__ Space | | | \ \ | | | | \ \|----------------| | | \ |Real KernelSpace| |________________| \|________________| Logical Addresses Physical Addresses Linear Kernel Space corresponds to Physical Kernel Space translated 3 GB down (in fact page tables are something like { "00000000", "00000001" }, so they operate no virtualization, they only report physical addresses they take from linear ones). Notice that you'll not have an "addresses conflict" between Kernel and User spaces because we can manage physical addresses with Page Tables. Swap Swap is managed by the kswapd daemon (kernel thread). Swapping is needed when we have to access a page that is not in physical memory. Linux uses ''kswapd'' kernel thread to carry out this purpose. When the Task receives a page fault exception we do the following: | Page Fault Exception | cause by all these conditions: | a-) User page | b-) Read or write access | c-) Page not present | | -----------> |do_page_fault |handle_mm_fault |pte_alloc |pte_alloc_one |__get_free_page = __get_free_pages |alloc_pages |alloc_pages_pgdat |__alloc_pages |wakeup_kswapd // We wake up kernel thread kswapd Page Fault ICA * Stack and Heap * Memory allocation FF.. | | <-- bottom of the stack /|\ | | | higher | | | | stack values | | | \|/ growing | | XX.. | | <-- top of the stack [Stack Pointer] | | | | | | 00.. |_________________| <-- end of stack [Stack Segment] Stack Stack is used by functions for: * global variables * local variables * return address For example, for a classical function: |int foo_function (parameter_1, parameter_2, ..., parameter_n) { |variable_1 declaration; |variable_2 declaration; .. |variable_n declaration; |// Body function |dynamic variable_1 declaration; |dynamic variable_2 declaration; .. |dynamic variable_n declaration; |// Code is inside Code Segment, not Data/Stack segment! |return (ret-type) value; // often it is inside some register, for i386 eax register is used. |} we have | | | 1. parameter_1 pushed | \ S | 2. parameter_2 pushed | | Before T | ................... | | the calling A | n. parameter_n pushed | / C | ** Return address ** | -- Calling K | 1. local variable_1 | \ | 2. local variable_2 | | After | ................. | | the calling | n. local variable_n | / | | ... ... Free ... ... stack | | H | n. dynamic variable_n | \ E | ................... | | Allocated by A | 2. dynamic variable_2 | | malloc & kmalloc P | 1. dynamic variable_1 | / |_______________________| Typical stack usage Note: variables order can be different depending on hardware architecture. Copy_on_write Copy_on_write is a mechanism used to reduce memory usage. It postpones memory allocation until the memory is really needed. For example, when a task executes the "fork()" system call (to create another task), we still use the same memory pages as the parent, in read only mode. When a task WRITES into the page, it causes an exception and the page is copied and marked "rw" (read, write). 1-) Page X is shared between Task Parent and Task Child Task Parent | | RO Access ______ | |---------->|Page X| |_________| |______| /|\ | Task Child | | | RO Access | | |---------------- |_________| 2-) Write request Task Parent | | RO Access ______ | |---------->|Page X| Trying to write |_________| |______| /|\ | Task Child | | | RO Access | | |---------------- |_________| 3-) Final Configuration: Either Task Parent and Task Child have an independent copy of the Page, X and Y Task Parent | | RW Access ______ | |---------->|Page X| |_________| |______| Task Child | | RW Access ______ | |---------->|Page Y| |_________| |______| IRQ IRQ is an asyncronous signal sent to microprocessor to advertise a requested work is completed |<--> IRQ(0) [Timer] |<--> IRQ(1) [Device 1] | .. |<--> IRQ(n) [Device n] _____________________________| /|\ /|\ /|\ | | | \|/ \|/ \|/ Task(1) Task(2) .. Task(N) IRQ - Tasks Interaction Schema 1. IRQ (i) occurs and Task(j) is interrupted 2. IRQ(i)_handler is executed 3. control backs to Task(j) interrupted Fájlrendszerek Ext2 Superblock ext2fs superblock [include/linux/ext2_fs.h]: struct ext2_super_block { unsigned long s_inodes_count; unsigned long s_blocks_count; unsigned long s_r_blocks_count; unsigned long s_free_blocks_count; unsigned long s_free_inodes_count; unsigned long s_first_data_block; unsigned long s_log_block_size; long s_log_frag_size; unsigned long s_blocks_per_group; unsigned long s_frags_per_group; unsigned long s_inodes_per_group; unsigned long s_mtime; unsigned long s_wtime; unsigned short s_mnt_count; short s_max_mnt_count; unsigned short s_magic; unsigned short s_state; unsigned short s_errors; unsigned short s_pad; unsigned long s_lastcheck; unsigned long s_checkinterval; unsigned long s_reserved[238]; }; |s_inodes_count| the total number of inodes on the fs. |s_blocks_count| the total number of blocks on the fs. |s_r_blocks_count| the total number of blocks reserved for the exclusive use of the superuser. |s_free_blocks_count| the total number of free blocks on the fs. |s_free_inodes_count| the total number of free inodes on the fs. |s_first_data_block| the position on the fs of the first data block. Usually, this is block number 1 for fs containing 1024 bytes blocks and is number 0 for other fs. |s_log_block_size| used to compute the logical block size in bytes. The logical block size is in fact |1024 << s_log_block_size|. |s_log_frag_size| used to compute the logical fragment size. The logical fragment size is in fact |1024 << s_log_frag_size| if |s_log_frag_size| is positive and |1024 >> -s_log_frag_size| if |s_log_frag_size| is negative. |s_blocks_per_group| the total number of blocks contained in a group. |s_frags_per_group| the total number of fragments contained in a group. |s_inodes_per_group| the total number of inodes contained in a group. |s_mtime| the time at which the last mount of the fs was performed. |s_wtime| the time at which the last write of the superblock on the fs was performed. |s_mnt_count| the number of time the fs has been mounted in read-write mode without having been checked. |s_max_mnt_count| the maximum number of time the fs may be mounted in read-write mode before a check must be done. |s_magic| a magic number that permits the identification of the file system. It is |0xEF53| for a normal ext2fs and |0xEF51| for versions of ext2fs prior to 0.2b. |s_state| the state of the file system. It contains an or'ed value of EXT2_VALID_FS (0x0001) which means: unmounted cleanly; and EXT2_ERROR_FS (0x0002) which means: errors detected by the kernel code. |s_errors| indicates what operation to perform when an error occurs. See section Error Handling |s_pad| unused. |s_lastcheck| the time of the last check performed on the fs. |s_checkinterval| the maximum possible time between checks on the fs. || Inodes An inode uniquely describes a file |i_mode| type of file (character, block, link, etc.) and access rights on the file. |i_uid| uid of the owner of the file. |i_size| logical size in bytes. |i_atime| last time the file was accessed. |i_ctime| last time the inode information of the file was changed. |i_mtime| last time the file content was modified. |i_dtime| when this file was deleted. |i_gid| gid of the file. |i_links_count| number of links pointing to this file. |i_blocks| number of blocks allocated to this file counted in 512 bytes units |i_version| version of the file (used by NFS). |i_file_acl| control access list of the file (not used yet). |i_dir_acl| control access list of the directory (not used yet). |i_faddr| block where the fragment of the file resides. |i_frag| number of the fragment in the block. |i_size| size of the fragment. |i_flags| flags: |EXT2_SECRM_FL 0x0001| secure deletion |EXT2_UNRM_FL 0x0002| undelete |EXT2_COMPR_FL 0x0004| compress file |EXT2_SYNC_FL 0x0008| synchronous updates. || |EXT2_BAD_INO 1| a file containing the list of bad blocks on the file system. | | |i_block| pointers to blocks