Insight
Insight
MetisX Co., Ltd.
Company Registration Number : 710-81-02837
Address : 20, Pangyoyeok-ro 241beon-gil, Bundang-gu, Seongnam-si, Gyeonggi-do, Republic of Korea
CEO : Jin Kim
© 2024 MetisX | All Rights Reserved
MetisX Co., Ltd.
Company Registration Number : 710-81-02837
Address : 20, Pangyoyeok-ro 241beon-gil, Bundang-gu,
Seongnam-si, Gyeonggi-do, Republic of Korea
CEO : Jin Kim
© 2024 MetisX | All Rights Reserved
RAS stands for Reliability, Availability, and Serviceability, describing the robustness of a computing system. When it comes to memory specifically, it refers to fault-tolerant hardware designs such as ECC(Error Correction Code), chipkill, memory scrubbing, and so on.
Errors in the memory subsystem occur due to design failures, defects, or electrical noise in any one of the components. These errors are classified as either hard errors (caused by design failures) or soft errors (caused by system noise or memory array bit flips due to alpha particles, etc.). RAS features allow the system to continue operating when there are correctable errors while logging the uncorrectable error details for future debugging purposes.
One of the most common RAS schemes used in the memory subsystem is ECC. ECC-enabled memory generally uses parity to detect data corruption in memory by including extra memory bits to record parity. It is designed to detect errors of two bits per word but can only correct up to a single bit per word. Applications using standard DDR memories (i.e. DDR4, DDR5) implement ECC by using the side-band scheme, which could be described as follows:
Write:
① SoC sends data to memory controller.
② Controller generates parity and send it with the data to memory.
③ Data and parity are stored on separate DRAMs.
Read:
① Memory controller reads Data and parity stored on DRAMs.
② Controller checks parity on the data read.
③ If there is an error, data is sent with error correction, if no error, data is sent to SoC.
MetisX’ computational memory solution supports an advanced ECC scheme that allows the system to correct single/double die failures, which is known as “chipkill”. In our next article, we will delve into this feature in more details and how it works.