[번역] scheduler/sched-design-CFS.txt

iamyooon 2018. 8. 3. 17:33

2018. 8. 3. 17:33

CFS Scheduler

1. 개요(Overview)

CFS는 "완벽하게 공정한 스케쥴러"를 의미하는 Ingo Molnar에 의해 구현된 새로운 "데스크탑" 프로세스 스케줄러다. Linux 2.6.23에 통합되었으며 이전 바닐라 스케쥴러의 SCHED_OTHER c코드를 대체한다.

CFS stands for "Completely Fair Scheduler," and is the new "desktop" process scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the replacement for the previous vanilla scheduler's SCHED_OTHER interactivity code.

CFS 디자인의 80%는 한줄로 요약할 수 있다:

80% of CFS's design can be summed up in a single sentence:

CFS는 실제 하드웨어 위에서 "이상적이고, 정밀한 멀티태스킹 CPU"를 모델링한다.
CFS basically models an "ideal, precise multi-tasking CPU" on real hardware.

100%의 physical power를 가지고 있는 "(존재하지 않지만) 이상적인 멀티태스킹 CPU"는 각 태스크를 병렬적으로 정확히 1/nr_running의 공평한 속도로 실행하는 CPU이다. 예를 들어, 두개의 태스크가 실행중이라면, 태스크들은 각각 50%의 physical power로 병렬실행한다.

"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical power and which can run each task at precise equal speed, in parallel, each at 1/nr_running speed. For example: if there are 2 tasks running, then it runs each at 50% physical power --- i.e., actually in parallel.

실제 하드웨어에서는 한 순간에 하나의 태스크만 실행할 수 있기 때문에 virtual runtime이라는 개념을 도입했다. 태스크의 virtual runtime은 위에서 설명한 이상적인 멀티태스킹 CPU에서 태스크가 실행을 시작하기 위한 다음 timeslice가 언제인지를 나타낸다. 실제로 태스크의 virtual runtime은 전체 실행중인 태스크 개수로 normalize한 진짜 실행시간이다.

On real hardware, we can run only a single task at once, so we have to introduce the concept of "virtual runtime." The virtual runtime of a task specifies when its next timeslice would start execution on the ideal multi-tasking CPU described above. In practice, the virtual runtime of a task is its actual runtime normalized to the total number of running tasks.

2. 몇가지 자세한 구현사항(FEW IMPLEMENTATION DETAILS)

CFS에서 virtual runtime은 각 태스크의 p->se.vruntime(나노초 단위)으로 표현되고 추적된다. 이 방식을 사용하면 태스크의 정확히 타임스탬프를 추적할 수 있고 태스크가 가져야 할 "예상 CPU 시간"을 측정할 수 있다.

In CFS the virtual runtime is expressed and tracked via the per-task p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately timestamp and measure the "expected CPU time" a task should have gotten.

[ small detail: 이상적인 하드웨어에서, 모든 태스크는 늘 동일한 p->se.vruntime 필드를 가진다. 예를 들어 태스크들은 동시에 실행되고 모든 태스크는 CPU시간를 공평하게 분배받는다.]

[ small detail: on "ideal" hardware, at any time all tasks would have the same p->se.vruntime value --- i.e., tasks would execute simultaneously and no task would ever get "out of balance" from the "ideal" share of CPU time. ]

CFS의 태스크 선택 로직은 p->se.vruntime 필드에 기반한다. 매우 간단하다. CFS는 항상 가장 작은 ->se.vruntime 필드를 가진 태스크(제일 적게 실행된 태스크)를 실행하려고 시도한다. 또한 가능한 한 이상적인 멀티태스킹 하드웨어에 가깝게 CPU 시간을 나누려고 한다.

CFS's task picking logic is based on this p->se.vruntime value and it is thus very simple: it always tries to run the task with the smallest p->se.vruntime value (i.e., the task which executed least so far). CFS always tries to split up CPU time between runnable tasks as close to "ideal multitasking hardware" as possible.

대부분의 남은 CFS의 디자인은 nice 레벨, 멀티프로세싱, sleeper를 알아내기 위한 다양한 변형 알고리즘과 같은 몇가지 추가 장치(add-on embellishment)과 함께 이 단순한 개념에 집중한다.

Most of the rest of CFS's design just falls out of this really simple concept, with a few add-on embellishments like nice levels, multiprocessing and various algorithm variants to recognize sleepers.

3. 레드블랙트리(THE RBTREE)

CFS의 디자인은 (과거의 것을 많이 사용하지 않는다는 의미에서)꽤 급진적이다. 런큐에 대한 오래된 자료구조를 사용하지 않고 다음 태스크 실행을 위한 타임라인을 만들기 위해 시간 순으로 정렬된 레드블랙 트리를 사용한다. 그리고 배열을 교체하는 구조가 없다.(이런 변화로 인해 기존의 바닐라스케쥴러와 RSDL/SD가 영향을 받는다)

CFS's design is quite radical: it does not use the old data structures for the runqueues, but it uses a time-ordered rbtree to build a "timeline" of future task execution, and thus has no "array switch" artifacts (by which both the previous vanilla scheduler and RSDL/SD are affected).

CFS는 또한 런큐 안의 모든 태스크 중에서 제일 작은 vruntime값을 추적하고 있는 rq->cfs.min_vruntime 필드를 유지한다. 이 값은 단방향으로 증가되는 값이다. 시스템에 의해 처리된 일의 총량은 min_vruntime에 의해 추적된다. 이 값은 새롭게 활성화된 entity를 가능한 한 레드블랙 트리의 왼쪽에 위치시키는데 사용된다.

CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic increasing value tracking the smallest vruntime among all tasks in the runqueue. The total amount of work done by the system is tracked using min_vruntime; that value is used to place newly activated entities on the left side of the tree as much as possible.

런큐안에 동작중인 혹은 동작가능한 태스크의 개수는 rq->cfs.load 필드를 통해 집계되고 있다. 이 값은 런큐에 enqueue된 모든 태스크의 전체 weight이다.

The total number of running tasks in the runqueue is accounted through the rq->cfs.load value, which is the sum of the weights of the tasks queued on the runqueue.

CFS는 모든 태스크들이 p->se.vruntime 필드에 의해 정렬된 모든 태스크들이 있는 레드블랙 트리를 유지한다. CFS는 이 트리의 가장 왼쪽의 태스크를 선택하고 집중한다. 시스템이 계속 동작할 수록 실행된 태스크는 점점 트리의 오른쪽으로 이동할 것이다. 천천히 하지만 반드시 모든 태스크가 가장 왼쪽의 태스크가 되도록 기회를 준다. 따라서 정해진 시간안에 CPU를 얻을 수 있다.

CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the p->se.vruntime key. CFS picks the "leftmost" task from this tree and sticks to it. As the system progresses forwards, the executed tasks are put into the tree more and more to the right --- slowly but surely giving a chance for every task to become the "leftmost task" and thus get on the CPU within a deterministic amount of time.

요약하자면, CFS는 다음과 같이 동작한다. 태스크가 잠시 실행된 후 스케쥴될 때(혹은 스케쥴러 틱이 발생할 때) 태스크의 CPU사용량은 집계된다. 태스크가 실제 CPU를 사용해서 소모한 (작은)시간은 p->se.vruntime에 추가된다.

Summing up, CFS works like this: it runs a task a bit, and when the task schedules (or a scheduler tick happens) the task's CPU usage is "accounted for": the (small) time it just spent using the physical CPU is added to p->se.vruntime.

p->se.vruntime이 충분히 커지면 다른 태스크는 CFS가 관리하는 레드블랙 트리(시간 순으로 정렬된)의 가장 왼쪽 태스크가 된다. (작은 양의 "granularity"를 더해줘서 leftmost 태스크가 CPU를 과하게 쓰지 않게 하고 cache trashing도 막는다.) 그런다음 새롭게 가장왼쪽에 오게 된 태스크에 의해 현재 태스크가 선점된다.(leftmost task가 CPU를 점유한다)

Once p->se.vruntime gets high enough so that another task becomes the "leftmost task" of the time-ordered rbtree it maintains (plus a small amount of "granularity" distance relative to the leftmost task so that we do not over-schedule tasks and trash the cache), then the new leftmost task is picked and the current task is preempted.

4. CFS의 몇가지 기능(SOME FEATURES OF CFS)

CFS는 나노초 단위의 통계를 사용하고 있으며 jiffies나 HZ 값에 의존적으로 동작하지 않는다. 따라서 예전 스케쥴러가 가지고 있던 timeslice라는 개념을 CFS 스케쥴러는 가지고 있지 않다. 그리고 그 어떤 휴리스틱도 사용하지 않는다. 그대신 단 하나의 central tunable을 가지고 있다(단 CONFIG_SCHED_DEBUG를 활성화해야 한다.)

CFS uses nanosecond granularity accounting and does not rely on any jiffies or other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the way the previous scheduler had, and has no heuristics whatsoever. There is only one central tunable (you have to switch on CONFIG_SCHED_DEBUG):

/proc/sys/kernel/sched_min_granularity_ns

이 값은 스케쥴러를 데스크탑용(낮은지연시간)에서 서버용(good batching)으로 튜닝하는데 사용할 수 있다. 이 변수는 기본적으로 데스크탑 워크로드에 적합하게 설정되어 있다. SCHED_BATCH 태스크도 역시 CFS 스케줄러 모듈에 의해 처리된다.

which can be used to tune the scheduler from "desktop" (i.e., low latencies) to "server" (i.e., good batching) workloads. It defaults to a setting suitable for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too.

이런 디자인으로 인해, CFS 스케쥴러는 스케쥴러의 휴리스틱에 사용되는 현존하는 어떠한 공격에도 정상동작하고 응답성에 영향을 받지 않으며 예측된 행동을 보여준다.

Due to its design, the CFS scheduler is not prone to any of the "attacks" that exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all work fine and do not impact interactivity and produce the expected behavior.

CFS 스케쥴러는 과거의 바닐라 스케쥴러에 비해 nice level 처리와 SCHED_BATCH 태스크처리가 더 좋아졌다. 두 타입의 워크로드는 더욱 더 분리된다.

The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH than the previous vanilla scheduler: both types of workloads are isolated much more aggressively.

SMP 로드밸런싱 코드도 재작성되고 검증되었다. 기존의 런큐 탐색을 고려한 코드는 로드밸런싱에서 빠졌으며 현재는 스케줄링 모듈을 순회하는 방식이 사용되고 있다. 로드밸런싱 코드는 결과적으로 매우 단순해졌다.

SMP load-balancing has been reworked/sanitized: the runqueue-walking assumptions are gone from the load-balancing code now, and iterators of the scheduling modules are used. The balancing code got quite a bit simpler as a result.

5. 스케쥴링 정책(Scheduling policies)

CFS는 세가지 스케쥴링정책을 구현했다:

- SCHED_NORMAL(전통적으로 SCHED_OTHER로 불림)일반 태스크를 처리하는데 사용되는 정책이다.

- SCHED_NORMAL (traditionally called SCHED_OTHER)The scheduling policy that is used for regular tasks.

- SCHED_BATCH :

일반 태스크들처럼 자주 태스크가 선점되지 않고, 태스크가 비교적 오래 수행되며 캐시를 더욱 잘 활용하지만 응답성에 좋지 않다. 배치작업에 적합한 정책이다.

Does not preempt nearly as often as regular tasks would, thereby allowing tasks to run longer and make better use of caches but at the cost of interactivity. This is well suited for batch jobs.

- SCHED_IDLE : nice level +19인 태스크보다 우선순위가 떨어지는 정책이다. 우선순위가 역전되는 문제를 피하기 위한 idle timer scheduler는 아니다.

This is even weaker than nice 19, but its not a true idle timer scheduler in order to avoid to get into priority inversion problems which would deadlock the machine.

SCHED_FIFO/_RR 은 sched/rt.c에 구현되어 있고 POSIX에 정해진 것에 맞게 작성되었다. util-linux-ng 2.13.1.1 의 chrt를 사용하면 SCHED_IDLE을 제외한 모든 정책으로 프로세스를 설정할 수 있다

SCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by POSIX. The command chrt from util-linux-ng 2.13.1.1 can set all of these except SCHED_IDLE.

6. 스케줄링 클래스(SCHEDULING CLASSES)

새로운 CFS 스케줄러는 스케줄러 모듈의 확장 가능한 계층구조인 스케줄링 클래스를 도입하는 방식으로 디자인되었다. 이 모듈들은 스케줄링 정책으로 추상화되며 스케줄러 코어에 의해 처리된다. 스케줄러 코어는 특정 스케줄링 정책에 대한 가정을 한 코드는 가지고 있지 않는다.

The new CFS scheduler has been designed in such a way to introduce "Scheduling Classes," an extensible hierarchy of scheduler modules. These modules encapsulate scheduling policy details and are handled by the scheduler core without the core code assuming too much about them.

sched/fair.c 는 위에서 기술한 CFS 스케줄러를 구현했다.

sched/fair.c implements the CFS scheduler described above.

sched/rt.c는 과거 바닐라 스케줄러가 했던 것에 비해 SCHED_FIFO와 SCHED_RR semantics를 간단한 방법으로 구현했다. 100개의 런큐를 사용하고(과거의 스케줄러와 달리 100개 모두 RT우선순위 레벨을 위해 쓰인다) expired 배열은 쓰지 않는다.

sched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT priority levels, instead of 140 in the previous scheduler) and it needs no expired array.

스케줄링 클래스는 sched_class 구조체를 통해 구현되었다. 이 구조체는 관련된 이벤트가 발생했을 때 반드시 호출되어야 하는 함수에 대한 후킹 함수를 포함하고 있다.

Scheduling classes are implemented through the sched_class structure, which contains hooks to functions that must be called whenever an interesting event occurs.

아래는 후킹 함수의 일부이다.

This is the (partial) list of the hooks:

- enqueue_task(...)

태스크가 runnable 상태가 될 때 이 함수는 태스크의 스케쥴링 엔티티를 레드블랙트리에 넣기 위해 호출된다. 그리고 nr_running 변수를 증가시킨다.

Called when a task enters a runnable state. It puts the scheduling entity (task) into the red-black tree and increments the nr_running variable.

- dequeue_task(...)

태스크가 runnable 상태가 아닐때 이 함수는 해당 태스크를 레드블랙트리에서 빼기 위해 호출된다. 그리고 nr_running 변수를 감소시킨다.

When a task is no longer runnable, this function is called to keep the corresponding scheduling entity out of the red-black tree. It decrements the nr_running variable.

- yield_task(...)

이 함수는 compat_yield sysctl이 켜져있지 않다면 태스크를 레드블랙 트리에서 dequeue 한 다음 다시 enqueue한다. 이 경우에는 태스크를 레드블랙 트리의 가장 오른쪽으로 이동하게 된다.

This function is basically just a dequeue followed by an enqueue,unless the compat_yield sysctl is turned on; in that case, it places the scheduling entity at the right-most end of the red-black tree.

- check_preempt_curr(...)

이 함수는 runnable 상태에 진입한 태스크가 현재 동작중인 태스크를 선점해야 하는지를 체크한다.

This function checks if a task that entered the runnable state should preempt the currently running task.

- pick_next_task(...)

이 함수는 다음에 실행될 가장 적합한 태스크를 선택한다.

This function chooses the most appropriate task eligible to run next.

- set_curr_task(...)

이 함수는 태스크가 자신의 스케쥴링클래스를 변경하거나 자신이 속한 태스크 그룹을 변경할 때 호출된다.

This function is called when a task changes its scheduling class or changes its task group.

- task_tick(...)

이 함수는 대부분 타이머틱 함수에서 호출된다. 함수가 호출되면 실행하는 프로세스가 교체될 수도 있다. 이 함수는 선점을 수행한다.

This function is mostly called from time tick functions; it might lead to process switch. This drives the running preemption.

7. CFS에서의 그룹스케쥴링 확장(GROUP SCHEDULER EXTENSIONS TO CFS)

보통 스케줄러는 개별 태스크 단위와 동작한다. 그리고 각각의 태스크에게 공정하게 CPU 시간을 제공하기 위해 노력한다. 때때로 태스크들을 그룹으로 묶고 CPU 시간을 각 그룹에게 제공하는게 바람직 할 때도 있다. 예를 들어 CPU 시간을 시스템의 각 유저에게 공정하게 나눠주고 각각의 태스크는 유저에 속하는게 바람직할 수 있다.

Normally, the scheduler operates on individual tasks and strives to provide fair CPU time to each task. Sometimes, it may be desirable to group tasks and provide fair CPU time to each such task group. For example, it may be desirable to first provide fair CPU time to each user on the system and then to each task belonging to a user.

CONFIG_CGROUP_SCHED는 위 내용을 정확히 달성하기 위해 노력한다. 이 설정은 태스크들을 그룹으로 만들고 CPU 시간을 각 그룹간에 공평하게 나눠준다.

CONFIG_CGROUP_SCHED strives to achieve exactly that. It lets tasks to be grouped and divides CPU time fairly among such groups.

CONFIG_RT_GROUP_SCHED는 RT 태스크(SCHED_FIFO, SCHED_RR)그룹을 만들 수 있게 한다.

CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and SCHED_RR) tasks.

CONFIG_FAIR_GROUP_SCHED는 CFS 태스크(SCHED_NORMAL, SCHED_BATCH)그룹을 만들 수 있게 한다.

CONFIG_FAIR_GROUP_SCHED permits to group CFS

(i.e., SCHED_NORMAL and SCHED_BATCH) tasks.

이런 커널옵션은 CONFIG_CGROUPS가 정의되어 있어야 사용할 수 있다. 관리자가 cgroup pseudo filesystem을 사용해서 태스크 그룹을 만들게 한다. 이 파일시스템에 대한 추가정보를 원한다면 Documentation/cgroups/cgroups.txt를 보자.

These options need CONFIG_CGROUPS to be defined, and let the administrator create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See Documentation/cgroups/cgroups.txt for more information about this filesystem.

CONFIG_FAIR_GROUP_SCHED가 정의되어 있다면, pseudo filesystem을 사용해서 만든 각 그룹에 cpu.shares 파일이 생성된다. 태스크그룹을 만들거나 CPU share를 변경하려면 아래 예제를 보자.

When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each group created using the pseudo filesystem. See example steps below to create task groups and modify their CPU share using the "cgroups" pseudo filesystem.

# mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpu
# mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
# cd /sys/fs/cgroup/cpu

multimedia, browser 태스크 그룹을 생성한다

# mkdir multimedia
# mkdir browser

multimedia 태스크 그룹이 browser 태스크 그룹보다 두배의 CPU 대역폭을 받도록 설정한다.

# echo 2048 > multimedia/cpu.shares
# echo 1024 > browser/cpu.shares

firefox를 실행한 뒤 browser 태스크 그룹에 넣는다.

# firefox &
# echo <firefox_pid> > browser/tasks

movie player를 실행한 뒤 multimedia 태스크 그룹에 넣는다.

# echo <movie_player_pid> > multimedia/tasks

'커널 번역(기사,문서)' 카테고리의 다른 글

[번역] vm/page_owner.txt (0)	2018.09.05
[번역] lwn - Atomic context and kernel API design (0)	2018.09.04
[번역] vm/slub.txt (0)	2018.08.30
[번역] scheduler/sched-nice-design.txt (0)	2018.08.03
[번역] scheduler/sched-stats.txt (0)	2018.08.03

kernelmsg