20250716

1. Conditionals은 어떤 원리로 동작을 하는가?

[RedN 발췌]

We insert a CAS that compares the 64-bit value at the address of R2’s opcode attribute (initially NOOP) with its old parameter (also initially NOOP).

1.1) CAS는 어떻게 동작을 하는가?

CAS는 compare data field에 64-bit value와 destination buffer에 있는 value를 비교하는 operation이다. 즉, 아래 코드를 보면 cmp_data와 cmp_or_add_dest_buff(데이터 버퍼)에 있는 value를 비교하여 같으면 swap_or_add_data로 swap을 한다.

(출처: https://docs.nvidia.com/doca/archive/doca-v2.2.0/rdma-programming-guide/index.html)

struct doca_rdma_job_atomic {
    struct doca_job base;                           /**< Common job data */
    struct doca_buf *cmp_or_add_dest_buff;          /**< Destination data buffer */
    struct doca_buf *result_buff;                   /**< Result of the atomic operation:
                                                     *  remote original data before add, or remote original data
                                                     *  before compare
                                                     */
    uint64_t swap_or_add_data;                      /**< For add, the increment value
                                                     *  for cmp, the new value to swap
                                                     */
    uint64_t cmp_data;                              /**< Value to compare for compare and swap */
    struct doca_rdma_addr const *rdma_peer_addr;    /**< Optional: For RDMA context of type DC */
};

1.2) CAS의 원리를 RedN에서 살펴보자

We insert a CAS that compares the 64-bit value at the address of R2’s opcode attribute (initially NOOP) with its old parameter (also initially NOOP). We then set the id field of R2 to x. This field can be manipulated freely without changing the behavior of the WR, allowing us to use it to store x. Operand y is stored in the corresponding position in the old field of R1. This means that if x and y are equal, the CAS operation will succeed and the value in R1’s new field—which we set to WRITE—will replace R2’s opcode. Hence, in the case x = y, R2 will change from a NOOP into a WRITE operation. This WRITE is set to modify the data value of the return operation (R3) to 1. If x and y are not equal, the default value 0 is returned.

해당 문단의 의미는 다음 순서와 같다.

RDMA conditional execution을 실행하기 위해서 CAS verb를 삽입한다.
초기에는 CAS의 old field로 NOOP, new field로는 WRITE가 세팅되어 있다.
- CAS의 expected value는 R1의 old parameter에 해당하고 compare 대상으로 하는 주소는 the address of R2's opcode attribute로 이 주소에는 초기에 NOOP이라는 64-bit value가 설정되어 있다.
- CAS는 the address of R2's opcode attribute value와 R1의 expected value를 비교한다.
input x, y에서 x는 R2의 id field에 저장되며, y는 R1의 expected value's parameter로 저장된다.
- R2의 id field는 opcode field와 별개의 버퍼이다.
R1의 CAS는 R2의 opcode 주소에 있는 현재 value와 R1의 CAS가 가진 예상값을 비교합니다.
- x == y이면 CAS의 R2 field는 NOOP에서 WRITE로 변경된다. R2는 WRITE 동작을 수행하며 R3의 데이터 값을 R2에 설정된 데이터 값(1)로 수정되고 클라이언트에게 전송한다.
- x != y이면 R1의 CAS는 R2의 opcode를 변경하지 않는다. R2는 NOOP 상태로 남아 아무 동작도 수행하지 않으므로, R3의 데이터 값은 변경되지 않고 초기 값을 전송한다.

* 실제 코드에서...

swap_add가 new field (WRITE)에 해당하고 compare가 old field (NOOP)에 해당한다.

struct mlx5_wqe_atomic_seg {
	uint64_t	swap_add;
	uint64_t	compare;
};

sr2_ctrl->opmod_idx_opcode = sr2_ctrl->opmod_idx_opcode | 0x09000000; //SEND
sr1_atomic->swap_add =  htobe64(*((uint64_t *)&sr2_ctrl->opmod_idx_opcode)); //SEND

sr2_ctrl->opmod_idx_opcode = sr2_ctrl->opmod_idx_opcode & 0x00FFFFFF; //NOOP
sr1_atomic->compare = htobe64(*((uint64_t *)&sr2_ctrl->opmod_idx_opcode)); //NOOP

2. Loops는 어떤 원리로 동작을 하는가?

2.1) Bounded loops

The loop body uses a CAS verb to implement the if condition (line 3), followed by an ADD verb to increment i (line 6).
Given that the loop size is known a priori ( size = 2), RedN can unroll the while loop in advance and post the WRs for all iterations. As such, there is no need to check the condition at line 2. For each iteration, if the CAS succeeds, the NOOP verb in WQ1 will be changed to WRITE—which will send the response back to the client. However, it is clear that, regardless of the comparison result, all subsequent iterations will be executed. This is inefficient since, if the send (line 4) occurs before the loop is finished, a number of WRs will be wastefully executed by the NIC. This is impractical for larger loop sizes or if the number of iterations is not known a priori.

요약하자면, loop가 static한 경우에는 loop를 만들 필요 없이 미리 iteration 만큼 WR을 post하면 된다. line 3. if (x==A[i])에 대해서 CAS가 실패하면 WQ1은 NOOP을 유지하고 ADD (FAA) verb를 실행하며 다음 loop를 실행하면 된다. CAS가 성공하면 WQ1의 NOOP은 WRITE로 바뀌어서 send를 하면 된다. 하지만 이 방법은 CAS가 성공하여도 반복 횟수만큼 계속 loop를 진행해야 하기 때문에 비효율적이다.

2.2) Unbounded loops

For efficiency, we add a break that exits the loop if the element is found.The role of break is to prevent additional iterations from being executed. We use an additional NOOP that is formatted such that, once transformed into a WRITE by the CAS operation, it prevents the execution of subsequent iterations in the loop. This is done by modifying the last WR in the loop such that it does not trigger a completion event. The next iteration in the loop, which WAITs on such an event (via completion ordering), will therefore not be executed. Moreover, the WRITE will also modify the opcode of the WR used to send back the response from NOOP to WRITE.
As such, break allows efficient and unbounded loop execution. However, it still remains necessary for the CPU to post WRs to continue the loop after all its WRs are executed. This consumes CPU cycles and can even increase latency if the CPU is unable to keep up with the speed of WR execution.

기존에는 WQ2에서 CAS가 성공해서 WQ1의 NOOP이 WRITE로 바뀌어도 WQ2에서 CAS 후속 WR인 FAA가 있어서 CAS의 성공 여부 상관없이 ADD가 실행되어 NIC의 자원을 낭비했다. 이를 해결하기 위해서 WQ2에도 추가적인 NOOP을 사용한 것이다.

WQ2에서 CAS가 성공하면 WQ2의 NOOP은 WRITE (BREAK)로 수정되어 후속 반복이 안되도록 한다. 이때 completion event를 trigger하지 않도록 수정하는데, 그 이유는 loop의 첫 verb로 WAIT가 completion ordering을 위해 트리거가 되기 때문이다. WRITE로 루프의 마지막 WR을 수정하여 완료 이벤트가 발생이 안된다. 그래서 다음 루프 반복의 WAIT가 트리거 되지 않음으로써 다음 loop가 실행되지 않는다. 그 후 WQ2의 WRITE(NOOP)은 WQ1의 opcode를 WRITE로 변경하여 클라이언트에게 전송한다.
WQ2에서 CAS가 실패할 경우, opcode는 변경되지 않고 NOOP (WQ2)가 유지되며 그 다음 verb인 ADD가 수행된다. WQ1의 NOOP 또한 opcode가 변경되지 않은 채로 유지된다.

숨핑