커널 번역(기사,문서)

[번역] scheduler/sched-rt-group.txt

iamyooon 2018. 9. 8. 13:05

실시간 그룹스케쥴링(Real-Time group scheduling)

CONTENTS

0. 경고(WARNING)

1. 개요(Overview)

1.1 문제(The problem)

1.2 해결책(The solution)

2. 관련 인터페이스(The interface)

2.1 시스템 전역적인 설정(System-wide settings)

2.2 기본동작(Default behaviour)

2.3 태스크그룹을 만드는 기본(Basis for grouping tasks)

3. 앞으로의 계획(Future plans)

0. 경고(WARNING)

이 설정들을 조작하면 시스템이 불안정해질 수 있다. 이 kernel knob들은 root만 접근가능하며 본인이 knob을 통해 무엇을 하려고 하는지를 정확히 알고 있어야 한다.

Fiddling with these settings can result in an unstable system, the knobs are root only and assumes root knows what he is doing.

주목할 만한 것(Most notable):

* 태스크 그룹의 period가 hrtimer가 사용할 수 있는 최소단위의 시간이나 그룹이 사용할 시간을 재충전하는 시간보다 작을 경우에는 매우 작은 값의 sched_rt_period_us는 시스템을 불안하게 만들 수 있다.

* very small values in sched_rt_period_us can result in an unstable system when the period is smaller than either the available hrtimer resolution, or the time it takes to handle the budget refresh itself.

* 매우 작은 값의 sched_rt_runtime_us도 역시 시스템을 불안하게 만들 수 있다. 이 경우 시스템이 작업을 처리하는데 어려움을 겪을 수 있다.(migration thread, kstopmachine은 둘다 실시간 프로세스다)

* very small values in sched_rt_runtime_us can result in an unstable system when the runtime is so small the system has difficulty making forward progress (NOTE: the migration thread and kstopmachine both are real-time processes).

1. Overview

1.1 문제(The problem)

실시간 스케줄링은 모두 결정론에 관한 것이므로 태스크 그룹은 일정한 대역폭의 양에 의존 할 수 있어야한다. 여러개의 RT 태스크 그룹을 스케쥴링하기 위해서는 각 그룹에게 고정된 크기의 CPU 시간을 할당해야 한다. 그룹에게 최소한의 시간도 보장해주지 않는다면 원하는대로 실행되지 못할 것이다. 태스크 그룹은 자신이 사용할 최대 시간을 애매하게 제한해서 사용하지 않고 CPU 시간의 일부분을 고정해서 사용한다.

Realtime scheduling is all about determinism, a group has to be able to rely on the amount of bandwidth (eg. CPU time) being constant. In order to schedule multiple groups of realtime tasks, each group must be assigned a fixed portion of the CPU time available. Without a minimum guarantee a realtime group can obviously fall short. A fuzzy upper limit is of no use since it cannot be relied upon. Which leaves us with just the single fixed portion.

1.2 해결책(The solution)

주어진 period 안에서 시간을 얼마나 쓸건지를 명시하는 방법으로 CPU 시간을 할당한다. 이 시간(runtime)은 실시간 태스크그룹에게 할당되며, 할당받은 태스크 그룹만 쓸 수 있다.

CPU time is divided by means of specifying how much time can be spent running in a given period. We allocate this "run time" for each realtime group which the other realtime groups will not be permitted to use.

실시간그룹에 할당되지 않은 시간은 normal priority를 갖는 SCHED_OTHER 태스크가 사용하게 된다. 실시간그룹에게 할당되었지만 사용되지 않은 시간 또한 SCHED_OTHER 태스크가 사용한다.

Any time not allocated to a realtime group will be used to run normal priority tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by SCHED_OTHER.

예제를 통해 알아보자. 어떤 고정프레임 렌더러가 초당 25 프레임을 보장하기 위해서는, 프레임당 0.04s의 period를 가져야 한다. 그러면서 노래를 재생하고 입력에 응답하는 작업 또한 수행해야 할 경우, 그래픽처리를 위해 80%정도의 CPU시간을 준다면 그래픽 그룹에게 period마다 0.032s의 runtime(0.8 * 0.04s)을 줄 수 있다.

Let's consider an example: a frame fixed realtime renderer must deliver 25 frames a second, which yields a period of 0.04s per frame. Now say it will also have to play some music and respond to input, leaving it with around 80% CPU time dedicated for the graphics. We can then give this group a run time of 0.8 * 0.04s = 0.032s.

이런 방식대로라면 그래픽그룹은 period 0.04s에 runtime 0.032s을 가지게 된다. 만약 audio thread가 DMA buffer를 매 0.005s마다 다시 채워야할 필요가 있고 이 작업에 오직 CPU 시간의 약 3%만 필요하다면 runtime은 0.00015s(0.03*0.005s)가 된다. 이 그룹도 period 0.005s에 runtime은 0.00015s가 된다.

This way the graphics group will have a 0.04s period with a 0.032s run time limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = 0.00015s. So this group can be scheduled with a period of 0.005s and a run time of 0.00015s.

남은 CPU 시간은 유저입력을 처리하거나 다른 태스크를 위해 쓰인다. 실시간태스크들은 처리할 일을 위해 필요한 시간을 명시적으로 할당받았기 때문에 그래픽이나 오디오에서는 buffer underrun이 발생하지 않을 것이다.

The remaining CPU time will be used for user input and other tasks. Because realtime tasks have explicitly allocated the CPU time they need to perform their tasks, buffer underruns in the graphics or audio can be eliminated.

위 예제는 아직 완벽하게 구현되진 않았다. non-uniform period를 사용할 수 있는 EDF 스케쥴러는 여전히 구현되지 않았다.

NOTE: the above example is not fully implemented yet. We still lack an EDF scheduler to make non-uniform periods usable.

2. RT관련 인터페이스(The Interface)

2.1 시스템 전역적인 설정(System wide settings)

시스템 전역적인 세팅은 /proc 가상파일시스템에 있는 아래 파일을 통해 설정한다.

The system wide settings are configured under the /proc virtual file system:

/proc/sys/kernel/sched_rt_period_us:

CPU 대역폭 100%와 동일한 스케쥴링 주기를 나타낸다.

The scheduling period that is equivalent to 100% CPU bandwidth

/proc/sys/kernel/sched_rt_runtime_us:

실시간 스케쥴링에 얼마나 많은 시간을 사용할지를 나타내는 global limit을 나타낸다. CONFIG_RT_GROUP_SCHED가 활성화되어 있지 않아도 이 값은 실시간 프로세스에게 할당된 CPU 시간에 영향을 준다. CONFIG_RT_GROUP_SCHED가 활성화되어 있다면 실시간 그룹이 사용가능한 총 bandwidth를 나타낸다.

A global limit on how much time realtime scheduling may use. Even without CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth available to all realtime groups.

*인터페이스는 s32형 변수이므로 시간은 us단위를 사용한다. 따라서 동작가능한 설정값은 1us ~ 35min이다.

*Time is specified in us because the interface is s32. This gives an operating range from 1us to about 35 minutes.

*sched_rt_period_us는 1 ~ INT_MAX 범위의 값을 가질 수 있다.

*sched_rt_runtime_us는 -1 ~ (INT_MAX - 1) 범위의 값을 가질 수 있다

*runtime아 -1으로 설정되면 runtime과 period가 같게 되며 period를 제한없이 다 쓸 수 있다.

* sched_rt_period_us takes values from 1 to INT_MAX.

* sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).

* A run time of -1 specifies runtime == period, ie. no limit.

2.2 기본동작(Default behaviour)

sched_rt_period_us와 sched_rt_runtime_us의 기본값은 각각 (1sec)와 (0.95sec)이다. 실시간이 아닌 태스크 SCHED_OTHER는 runtime 0.05s를 사용할 수 있다. 기본값이 이렇게 설정된 이유는 실시간 태스크들이 시스템을 락업시키지 않고 시스템을 복구할 시간은 남겨두기 위해서다. runtime을 -1로 설정하면 기본 동작으로 되돌아간다.

The default values for sched_rt_period_us (1000000 or 1s) and sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away realtime tasks will not lock up the machine but leave a little time to recover it. By setting runtime to -1 you'd get the old behaviour back.

기본적으로 모든 bandwidth는 root 태스크그룹에게 할당되고 새로운 그룹의 period는 /proc/sys/kernel/sched_rt_period_us 값으로 설정한다. runtime은 0이 된다. 만약 bandwidth를 다른 태스크그룹에게 할당해주고 싶다면, root group에게 할당된 bandwidth를 줄이고 다른 그룹에게 할당하면 된다.

By default all bandwidth is assigned to the root group and new groups get the period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you want to assign bandwidth to another group, reduce the root group's bandwidth and assign some or all of the difference to another group.

실시간 그룹스케줄링에서는 실시간 태스크들을 그룹에 포함시키기 전에 그룹에게 전체 CPU bandwidth의 일정량을 할당해야 한다. 비록 유저가 실시간 우선순위를 가진 프로세스를 실행할 수 있는 권한이 있더라도 bandwidth할당이 안되면 rootgroup이 아닌 다른 그룹에서는 실시간태스크를 실행 할 수 없다.

Realtime group scheduling means you have to assign a portion of total CPU bandwidth to the group before it will accept realtime tasks. Therefore you will not be able to run realtime tasks as any user other than root until you have done that, even if the user has the rights to run processes with realtime priority!

2.3 태스크그룹을 만드는 기본방법(Basis for grouping tasks)

CONFIG_RT_GROUP_SCHED를 활성화하면 CPU bandwidth를 태스크그룹에 직접 할당할 수 있다. bandwidth는 cgroup 가상파일시스템을 사용해서 할당하며 각 cgroup에 할당해줄 CPU 시간을 조절하는것은 cgroup/cpu.rt_runtime_us를 사용한다. cgroup에 대한 더 많은 정보를 원한다면 Documentation/cgroups-v1/cgrous.txt를 읽으면 된다.

Enabling CONFIG_RT_GROUP_SCHED lets you explicitly allocate real CPU bandwidth to task groups. This uses the cgroup virtual file system and "/cpu.rt_runtime_us" to control the CPU time reserved for each control group. For more information on working with control groups, you should read Documentation/cgroups/cgroups.txt as well.

태스크 그룹에 설정된 값은 스케줄러에 의해 아래 제한을 만족하는지 검사된다.

Group settings are checked against the following limits in order to keep the configuration schedulable:

\Sum_{i} runtime_{i} / global_period <= global_runtime / global_period

아래와 같이 단순하게 만들수 있다.(앞으로의 계획을 참고하자)

For now, this can be simplified to just the following (but see Future plans):

\Sum_{i} runtime_{i} <= global_runtime

3. 앞으로의 계획(Future plans)

그룹의 scheduling period(/cpu.rt_period_us)를 설정할 수 있게 작업중이다.

There is work in progress to make the scheduling period for each group ("/cpu.rt_period_us") configurable as well.

자식 태스크 그룹의 period는 부모의 period와 같거나 작아야 한다는 제약이 있다. 하지만 현실적으로 이런 제약은 아직 유용하지 않고 데드라인 스케줄링을 쓰지 않는이상 starvation이 발생하는 경향이 있다.

The constraint on the period is that a subgroup must have a smaller or equal period to its parent. But realistically its not very useful _yet_ as its prone to starvation without deadline scheduling.

두개의 형제 태스크 그룹 A,B가 있고, 둘은 각각 50%의 bandwidth를 가지고 있지만 A의 주기가 B의 두배안 경우를 고려해 보자.

Consider two sibling groups A and B; both have 50% bandwidth, but A's period is twice the length of B's.

* group A: period=100000us, runtime=10000us

- 그룹 A는 매 0.1초마다 0.01초씩 실행한다.

- this runs for 0.01s once every 0.1s

* group B: period= 50000us, runtime=10000us

- 그룹 B는 매 0.1초마다 0.01초씩 두번 실행한다.(A의 두배)

- this runs for 0.01s twice every 0.1s (or once every 0.05 sec).

그룹 A에서 동작중인 while 루프문은 그룹 B의 전체 period동안 실행한다는 것을 의미한다. 그동안 그룹 B의 태스크들은 자신의 period동안 실행하지 못하고 기아상태에 빠질 수 있다.

This means that currently a while (1) loop in A will run for the full period of B and can starve B's tasks (assuming they are of lower priority) for a whole period.

리눅스 커널에게 완벽한 데드라인 스케줄링을 제공하기 위한 다음 프로젝트는 SCHED_EDF(earliest deadline first scheduling)가 될것이다. period의 끝을 데드라인으로 처리하는 데드라인 스케줄링은 두 그룹 모두에게 각자 할당된 시간을 사용할 수 있게 보장해 준다.

The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring full deadline scheduling to the linux kernel. Deadline scheduling the above groups and treating end of the period as a deadline will ensure that they both get their allocated time.

SCHED_EDF의 구현이 끝나려면 꽤 시간이 걸릴것이다. 가장 큰 문제는 Priority Inheritance다. 현재 리눅스의 PI infrastructure는 제한된 범위의 우선순위 레벨인 0-99에 맞춰져 있다. 데드라인 스케줄링을 사용하면 deadline inheritance를 할 필요가 있다.(왜냐하면 우선순위는 데드라인까지 남은 시간에 반비례 하기 때문이다.)

Implementing SCHED_EDF might take a while to complete. Priority Inheritance is the biggest challenge as the current linux PI infrastructure is geared towards the limited static priority levels 0-99. With deadline scheduling you need to do deadline inheritance (since priority is inversely proportional to the deadline delta (deadline - now)).

이것은 전체 PI 구조가 재작성되어야 함을 의미한다. PI구조는 가장 복잡한 코드중 하나이다.

This means the whole PI machinery will have to be reworked - and that is one of the most complex pieces of code we have.