亚洲免费av电影一区二区三区,日韩爱爱视频,51精品视频一区二区三区,91视频爱爱,日韩欧美在线播放视频,中文字幕少妇AV,亚洲电影中文字幕,久久久久亚洲av成人网址,久久综合视频网站,国产在线不卡免费播放

Large-Capacity and High-Speed lnstruction Cache Based on Divide-by-2 Memory Banks

2022-01-08 13:06:28QingQingLiZhiGuoYuYiSunJingHeWeiXiaoFengGu

Journal of Electronic Science and Technology 2021年4期

Qing-Qing Li | Zhi-Guo Yu| Yi Sun | Jing-He Wei | Xiao-Feng Gu

Abstract—An increase in the cache capacity is usually accompanied by a decrease in access speed.To balance the capacity and performance of caches,this paper proposes an instruction cache (ICache) architecture based on divide-by-2 memory banks (D2MB-ICache).The control circuit and memory banks of D2MB-ICache work at the central processing unit (CPU) frequency and the divide-by-2 CPU frequency,respectively,so that the capacity of D2MB-ICache can be expanded without lowering its frequency.For sequential access,D2MB-ICache can output the required instruction from memory banks per CPU cycle by dividing the memory banks with a partition mechanism and employing an inversed clock technique.For non-sequential access,D2MB-ICache will fetch certain jump instructions one or two more times,so that it can catch the jump of the request address in time and send the correct instruction to the pipeline.Experimental results show that,compared with conventional ICache,D2MB-ICaches with the same and double capacities show a maximum frequency increase by an average of 14.6% and 6.8%,and a performance improvement by an average of 10.3% and 3.8%,respectively.Moreover,energy efficiency of 64-kB D2MB-ICache is improved by 24.3%.

Index Terms—Cache capacity expansion,divide-by-2 frequency,instruction cache (ICache),inversed clock.

1.lntroduction

In recent years,the bandwidth and speed of the main memory have become more and more difficult to provide required amount of data to processors,making the processors unable to exhibit the desired performance.To solve this problem,the cache memory is usually included in modern computer systems.An instruction cache (ICache) utilizes the principle of locality to store some parts of programs from the main memory,thereby greatly improving the execution speed of the programs.Caches can mitigate the performance discrepancy between processors and the main memory,which has an important impact on the overall performance of processors.

Fig.1illustrates the structure of conventional 4-way set-associative ICache[1].Each memory address consists of the tag,set index,and byte offset.Basically,read operations in conventional ICache include addressing,reading the tag and data banks,judging the hit/miss state,and outputting the selected instruction.When there is a request,the tag and data banks will be addressed according to the“set index” and “set index+byte offset”,respectively.Then,the tag value extracted from the memory address will be compared with the tag values read from the tag bank.If the tag hits,the corresponding instruction will be selected;if the tag misses,the request will be forwarded to the lower level memory.The cache performance is generally determined by the hit ratio,latency,speed,and power consumption[2]-[4].The large cache capacity is provided to mitigate capacity misses and improve the hit ratio.However,the increase in the cache size will increase the access time,restricting the development of high-frequency processors[5].Since more and more advanced processors with high data bandwidth desire a larger and faster cache[4],[6],[7],it becomes necessary to explore a tradeoff between the cache capacity and performance.

On the one hand,researchers have attempted to make full use of the limited cache space and optimize the cache performance,for example,by improving cache replacement algorithms[8]-[10],developing prefetching mechanisms[11],[12],and introducing data compression techniques[13],[14].The cache replacement algorithms and prefetching mechanisms effectively manage the limited L1 cache and decrease the missing penalty.Nevertheless,the dilemma of the L1 cache capacity still exists.The data compression can increase the effective (logical) capacity of caches,whereas it brings the decompression latency.The compression schemes are more suitable for the last-level caches (LLCs) which focus on minimizing the miss rate[15].On the other hand,alternative memory technologies have also been explored[16]-[21]since the static random access memory (SRAM) cannot exhibit ideal performance due to its low density and high leakage power in the ultradeep submicron processes.Owing to their negligible leakage power and high density,non-volatile memories(NVMs) are considered to be applied to many fields in the future[22].However,the endurance of NVMs is insufficient,which is the major challenge for NVMs to replace SRAM.As a result,SRAM is still the most widely used in the highest-level cache.

This paper proposes a large-capacity and high-speed ICache architecture based on divide-by-2 memory banks (D2MB-ICache).Different from conventional ICache,D2MB-ICache operates at the central processing unit (CPU) frequency and its memory banks operate at the divide-by-2 CPU frequency.This method can expand the D2MB-ICache capacity without lowering its frequency.Compared with conventional ICache,D2MB-ICache takes two CPU cycles per fetching,which halves the speed of instruction fetch,as shown inFig.2.We hereby achieve a large-capacity and high-speed D2MB-ICache architecture through the following three contributions.

1) To avoid missing requests when they cross the CPU clock and the memory bank clock,the data bank and the tag bank are divided according to a partition mechanism.

Fig.1.Conventional 4-way set-associative ICache.

Fig.2.Read policy in different ICaches:(a) conventional ICache and (b) D2MB-ICache.

2) D2MB-ICache triggers memory banks with inversed clocks and allocates instructions to adjacent memory banks so that the read operations in D2MB-ICache are similar to conventional ICache.The required instruction is taken from the memory banks per CPU cycle unless there is a non-sequential request address.

3) A jump replay (JR) module is designed to avoid missing the read request when the access address is non-sequential.The JR module can detect the change of the request address and control the read operations in D2MB-ICache.

The performance evaluation results indicate that,in the 55-nm Semiconductor Manufacturing International Corporation (SMIC) complementary metal-oxide-semiconductor (CMOS) process,D2MB-ICaches with the same (1×) and double (2×) capacities can increase the maximum frequency and reduce the execution time compared with conventional ICache.Duadruple-size D2MB-ICache shows almost no reduction in the maximum frequency and execution time.Moreover,the proposed ICache with a large capacity has a significant improvement on energy efficiency.The rest of this paper is organized as follows.Section 2 describes the architecture of D2MB-ICache.In Section 3,the write/read operations in D2MB-ICache are presented.Experimental results are described and explained in Section 4.Section 5 discusses related research work.Finally,the conclusion is drawn in Section 6.

2.D2MB-lCache Architecture

As shown inFig.3,the tag and data banks in D2MB-ICache are divided into multiple small memory banks that operate at the divide-by-2 frequency.Besides,a JR module and an address change judgment for read operations are added to detect how Req_addr changes and determine whether D2MB-ICache repeatedly accesses a non-sequential address.If a request to D2MB-ICache occurs,a cache line is selected after addressing.Meanwhile,the JR module compares the lowest bit of “index” (TS) and the lowest log2Nbits of “i ndex+offset ?log2(LBW/8)”(DS),which are detailedly described in subsection 2.1,in the current request address with the address in S1.Based on the change of the request address,the JR module will send three signals (S1_kill,S1_replay,and S2_replay) to control whether the access operation needs to be repeated.Finally,theproposed ICache either outputs the corresponding instructions or forwards the request to the next level memory.Table 1depicts some abbreviations in this work.

Table 1:Abbreviations in the work

Fig.3.D2MB-ICache architecture.

2.1.Partition Mechanism of Memory Banks

In conventional ICache,the operating frequency of memory banks is usually the same as that of CPU.Conventional ICache could store SBW-bit data or load LBW-bit data per CPU cycle.However,D2MB-ICache needs to store (2×SBW)-bit data per memory banks’ clock cycle.If the number of data memory banks in D2MB-ICache is still SBW/LBW,the written data will be lost.To solve this problem,the tag and data banks are divided into two tag memory banks (T0and T1) andNdata memory banks (D0,D1,···,DN-1),respectively.The width of data memory banks is LBW bits.The parameterNis expressed as follows:

Since the data packet transferred between D2MB-ICache and CPU is LBW bits,the address offset of the next sequential access is LBW/8.The instructions are stored in the location specified by“index+offset ?log2(LBW/8)” of the request address.As shown inFig.3,D2MB-ICache utilizes DS to select the corresponding data memory bank.DS from the request address is defined as follows:

The tag is stored in the location specified by “i ndex”.Therefore,the tag memory banks can be enabled via TS inFig.3of “i ndex”.TS is defined as follows:

2.2.lnversed Clock Technique

Conventional ICache can receive a new request and output an instruction per CPU cycle.However,in D2MB-ICache,the memory banks operate at the divide-by-2 CPU frequency and take two CPU clock cycles to read data.To make D2MB-ICache send an instruction to the pipeline per CPU cycle,we adopted an inversed clock technique.In this work,a part of the memory banks (D0,D2,···,DN-2,and T0) operate at clk1,which is divided by two from the CPU clock,clk.The other part of the memory banks operate at an inversed clock,clk2,which is offset by 180° from clk1.In addition,the instructions are distributed to adjacent data memory banks in order.Accordingly,as shown inFig.4,adjacent memory banks can sequentially transfer data to the output port of D2MB-ICache per CPU cycle when the request addresses are sequential.

Fig.4.Inversed clock for D2MB-ICache’s memory banks.

2.3.JR Module

When non-sequential access occurs,the same divide-by-2 data (or tag) memory bank may be accessed continuously,which results in missing the requests.Therefore,we designed a JR module in the proposed ICache.Firstly,the JR module judges the relationship between the memory banks accessed in S0 and S1 by comparing the TS and DS bits in Req_addr_S0 and Req_addr_S1.Then,to avoid missing nonsequential access,the JR module outputs S1_replay or S2_replay and determines the number of times repeatedly accessing a non-sequential request address.When S2_replay is set to 1,D2MB-ICache will access Req_addr_S2 twice.When S1_replay is set to 1,D2MB-ICache will access Req_addr_S1 one more time.Besides,when the JR module controls D2MB-ICache to repeatedly access an address,the access request in S1 is incorrect.Therefore,the JR module sends the signal S1_kill to indicate that the access operation in S1 is invalid.These three control signals ensure that each valid non-sequential access can be implemented in the memory banks.The access operations controlled by the JR module will be elaborated in subsection 3.2.

3.Access Operations in D2MB-lCache

3.1.Write Operations

In conventional ICache,when the cache miss happens,the write operation is performed every CPU cycle and lasts for BS/SBW times.In D2MB-ICache,D2MB takes two CPU cycles to complete a write operation.Therefore,in the two CPU cycles,there are two write requests,and (2 ×SBW)-bit data should be written into the memory banks.To avoid missing the write requests,the (2 ×SBW)-bit data are separated into three parts and written into the data memory banks in four CPU cycles.The progress of write operations in data memory banks is illustrated inFig.5.The signals,data_bank_wmode_0,data_bank_wmode_1,and data_bank_wmode_2,are used to enable the write modes of DP(P=0,2,···,(N/2) -2),DMand D2M+1(M=1,3,···,(N/2) -1),and D2P+2,respectively.When there is a write request,SBW-bit input data (Din) from the lower level memory are split intoN/2 data (LBW-bit) and written into the corresponding data memory banks.In D2MB-ICache,the write operations last for (BS/SBW)+2 CPU cycles.For instance,when SBW=64 and LBW=32,Nis equal to 4 and the 64-bit data_input is split into two 32-bit instructions.On the first rising edge of clk (the first rising edge of clk1),the low 32-bit Din0 is written into D0.On the second rising edge of clk(the first rising edge of clk2),the high 32-bit Din0 and the high 32-bit Din1 are written into D1and D3,respectively.On the third rising edge of clk (the second rising edge of clk1),the low 32-bit Din1 is written into D2.

Fig.5.Timing of write operations in data memory banks.

Although this mechanism can ensure the correct data written into the data memory banks,it may increase the write latency.Algorithm 1 inTable 2shows that there are three cases for the read request after the write operation.When DS[0]=1’h1,the read operation is delayed by one CPU cycle,and the required instruction is taken from DMor D2M+1(rows 2 to 4 in Algorithm 1).When DS[1:0]=2’h2,the read operation is delayed by two CPU cycles,and the required instruction is taken from D2P+2(rows 5 to 7).There is no extra latency when the required instruction is stored in DP(rows 8 to 9).Usually,the first read access to D0after write operations is performed when the sequentially fetched instructions miss,and the first read access to other data banks after write operations occurs only when taken branch instructions miss.Because the ICache miss almost occurs during the sequential access,the write operations only increase a little latency in D2MB-ICache.As for tag memory banks,the corresponding tag value is written into the same address in BS/SBW CPU clock cycles.Therefore,the write operations in the tag memory banks have no latency.

Table 2:Algorithm 1

3.2.Read Operations

3.2.1.Access to Data Memory Banks

According to the changes of the request address,the current instruction may be stored in 1) the data memory bank triggered by the inversed clock of the data memory bank accessed in S1,2) other memory banks with the same clock as the last memory bank accessed,or 3) the same memory bank in which the last instruction is stored.The above three changes are defined as Case1,Case2,and Case3,respectively.The changes of the request address during the sequential fetching belong to Case1.The instruction in S1 has to be a jump instruction in Case2 and Case3.It should be aware that the proposed architecture cannot be applied whenN=1,and there is no Case2 whenN=2.

In the pipeline,some instructions may be executed for several cycles,making a request address maintain several cycles.Therefore,a parity counter is included in the JR module to detect the number of consecutive access to a request address.Fig.6depicts the control flow for the read access to the data memory banks.The parity counter will output a signal,repeat_data_even,to indicate that Req_addr_S1 has been continuously accessed for even times.According to the state of repeat_data_even,two modes are defined:The even mode (repeat_data_even=1) and the odd mode(repeat_data_even=0).When there is a read request to D2MB-ICache,the JR module detects the parity counter firstly.In the odd mode,when the change of Req_addr is Case1,the read operation is normal and the proposed ICache can read the required instruction from the data memory bank every CPU cycle (Operation1,rows 1 to 3 in Definition 1 shown inTable 3);when the change of Req_addr is Case2,the JR module sends S1_replay and S1_kill to D2MB-ICache in the next CPU cycle,which can maintain Req_addr_S1 for one more CPU cycle and invalidate the request in S0 (Operation2,rows 4 to 7);when S1 is invalid,the read operation in Case3 is Operation2;when S1 is valid in Case3,the JR module sends S2_replay to D2MB-ICache to fetch the instruction in S2 for two more CPU cycles,and then it invalidates the request in S0 and S1(Operation3,rows 8 to 11).In the even mode,the read operation in Case1 is Operation2,and when the change of Req_addr is Case2 or Case3,the read operation is Operation1.

Table 3:Definition 1

3.2.2.Access to Tag Memory Banks

The changes of Req_addr for tag memory banks are classified into two cases:The tag memory bank accessed in S0 is different from that accessed in S1(Case4) or not (Case5).As shown inFig.7,the read operations of the tag memory banks in D2MBICache are similar to those of the data memory banks.Compared with the read operation in the data memory banks,the JR module infers in which mode the tag memory banks work based on repeat_tag_even.Besides,the JR module deals with Case4 and Case5 in the same way as Case1 and Case3.

Fig.6.Control flow for read operations in data memory banks.

Fig.7.Control flow for read operations in tag memory banks.

3.2.3.Timing of Read Operations

The design of the clocks makes memory banks catch the changes of request addresses in time in Case1 and Case4 in the odd mode,which does not increase extra latency for read operations.Fig.8illustrates the timing of the read operations in the odd mode when Case1 occurs.D2MB-ICache can catch every request and read instructions from the data memory banks in every CPU cycle.

The signals,S1_replay and S2_replay,can make D2MB-ICache not miss any jump of the request address and provide the pipeline with the correct instructions.Fig.9shows the timing of read operations in the odd mode when Case2 and Case3 occur.

Fig.8.Timing of read operations in odd-mode Case1.

Fig.9.Timing of read operations with S1_replay and S2_replay.

We assume that Req_add0 and Req_addr1 are both mapped to D0(Case3),and Req_add0 and Req_addr1 are mapped to D1and D3(Case2),respectively.On the second rising edge of clk,CPU requests to access Req_addr1.It means that D0should be accessed on the falling edge of clk1,which may do not meet the timing requirements and cause the wrong output.To solve this problem,the JR module sets S1_kill to 1 and invalidates the second access request.The dotted line with an arrow inFig.9indicates that the current access request is invalid.Besides,the JR module uses S2_replay to control D2MB-ICache to access Req_addr0 and Req_addr1 again on the second and third rising edges of clk1,respectively.On the 7th rising edge of clk,CPU requests to access Req_addr3,which causes D3to be accessed on the falling edge of clk2.To ensure the correct output of the proposed ICache,the JR module sends S1_kill and S1_replay,and delays the access to Req_addr3.It can be found that S1_replay and S2_replay increase the access latency.Benefitting from the low taken branch instruction ratio[23],the increased execution cycles caused by D2MBICache have limited impacts on the overall performance of CPU.

4.Experiment

4.1.Experimental Framework

The proposed method was implemented on Rocket-Chip[24],a single-issue in-order reduced instruction set computer five (RISC-V) processor.To evaluate the performance of D2MB-ICache,the RISC-V processor was simulated with the Synopsys verilog compiler simulator (VCS),and ten open-source benchmarks in Github were executed.The processor and its memory hierarchy configuration are described inTable 4.The executed benchmarks are described inTable 5.

Table 4:Configuration of the simulated processor

Table 5:Ten workloads used for evaluation

4.2.Frequencies and Energy Consumption

The memory banks are generated by the SMIC 55-nm single-port memory compiler.We performed the static timing and power analysis of different ICaches in the SMIC 55-nm standard cell library by Synopsys Design Compiler.Fig.10shows the comparison of the maximum frequencies between conventional 4-way set-associative ICache and D2MB-ICache with different capacities.With the same size,the proposed D2MBICache exhibits higher maximum frequencies than conventional ICache.For instance,128-kB D2MB-ICache can work at 794 MHz,which is 127 MHz higher than 128-kB conventional ICache.The average maximum operating frequencies of D2MB-ICaches with 1× and 2× capacities increase by 14.6% and 6.8% compared with conventional ICache,respectively.Moreover,the frequency of quadruple-size D2MB-ICache is almost as high as that of conventional ICache without capacity expansion.

Based onFig.10,we evaluated the energy consumption of different ICaches operating at their maximum possible frequencies.Fig.11presents the total energy consumption normalized to that of 16-kB conventional ICache.The proposed ICaches with the capacities of 16 kB,32 kB,64 kB,and 128 kB save 0.5%,16.1%,24.3%,and 24.8% energy consumption versus the conventional ones,respectively.That is because although the frequency of the proposed ICache has increased,the frequency of D2MB consisting of D2MB-ICache has decreased.Eventually,the proposed ICache consumes less energy than conventional ICache.

Fig.10.Maximum operating frequencies in various ICaches.

Fig.11.Normalized energy consumption in various ICaches.

Fig.12.Normalized execution cycles of processor applications with conventional ICache and D2MB-ICache.

4.3.Execution Cycles

The variation of execution cycles with different cache sizes and architectures was also analyzed.Processors with three ICache capacities (32 kB,64 kB,and 128 kB) were simulated by using the ten benchmarks presented inTable 5.Due to the existence of branch instructions in benchmarks,D2MB-ICache needs to fetch certain instructions repeatedly,resulting in more execution cycles of the processor,as illustrated inFig.12.Moreover,the number of access to conventional ICache and the increased execution cycles caused by D2MB-ICache are counted for each benchmark inFig.13.Obviously,the increase in execution cycles is not closely related to the number of non-sequential access.D2MB-ICache only repeatedly accesses certain branch target addresses,thereby minimizing the cost of execution cycles caused by address jump.In a word,the increase in the execution cycle is mainly affected by the actual changes in the request addresses.

Fig.13.Number of access (Y axis on the left) to conventional ICache and increased execution (EX) cycles (Y axis on the right) caused by D2MB-ICache.

4.4.Execution Time

To evaluate the performance of D2MB-ICache,the processors with different ICache configurations execute ten benchmarks at their maximum frequencies (shown inFig.14).With the same cache capacity,the processor with D2MB-ICache takes less execution time than conventional ICache,and their execution time difference increases with the increasing capacity.Apart from qsort and dhrystone,the execution time of double-size D2MB-ICache is less than that of conventional ICache without capacity expansion.Compared with conventional ICache,D2MB-ICaches with 1× and 2× capacities exhibit an average execution time reduction by 10.3% and 3.8%,respectively,while the execution time of quadruple-size D2MB-ICache is approximately equal to that of conventional ICache.

Fig.14.Normalized performance of the processors with different ICaches.

5.Related Work and Comparison

Atoofian[13]exploited the property of value similarity in the L1 data cache and the L2 caches to compress data in caches.Compared with conventional caches,the compressed caches improve performance by 10.1%on average,which is close to the performance improved by the double-sized caches.Due to the increased idle cache banks in the compressed caches,a power gating technique is possible to reduce the leakage power.As a result,these compressed caches decrease the total energy consumption by 8%.Rea and Atoofian[14]combined data prefetching and compression techniques to expand the logical capacity and mitigate the decompression latency in the compressed caches.They found that the compressed caches with a 4-kB last outcome (LO) and stride (S) prefetcher exhibit a 1.7% speedup over the compression-only caches.However,the performance improvement incurs the penalty of power and area.

Chiuet al.[25]proposed cache resiliency techniques,line recycling (LR) and bit bypass (BB-S),to optimize the cache architecture.Instead of simply disabling the working bits in faulty cache lines,LR reuses them and decreases the capacity loss by 33%.Furthermore,LR saves 43% of the energy consumption with a 0.77% L2 area cost in 28 nm.BB-S uses flip-flops to minimize the overhead of error entries,which provides error protection for the tag arrays.

Spin-transfer torque random access memory (STT-RAM) has been proposed as a promising replacement for SRAM in reducing the leakage power consumption and decreasing area overhead.Based on STT-RAM,Liet al.[18]and Kong[19]proposed novel architectures and effectively reduced energy consumption.However,STT-RAM has bad write latency,which degrades the performance of the whole processor.Chenget al.[20]proposed a locality-aware method with a hybrid SRAM and STT-RAM configurable architecture.With only 5%of the latency cost,the L1 cache has been improved by 15% to 20% in energy efficiency.As a summary,due to performance limitations,there is still a long way to replace SRAM with the NVM memory in the L1 cache.

Different from previous related work,our proposed technique enables the memory banks in ICache operate at the divide-by-2 CPU frequency,which improves the physical capacity,performance,and energy efficiency of ICache.Compared with double-sized conventional caches,D2MB-ICaches increase performance by 3.8% on average,while compressed caches in [14] only increase 1.7%.Moreover,the power consumption in this work is also better optimized than that of [18] and [19].

6.Conclusion

This paper demonstrates a large-capacity and high-speed ICache architecture consisting of memory banks operating at the divide-by-2 frequency.The proposed D2MB-ICache can provide instructions to the pipeline at the same frequency with CPU by 1) dividing the data and tag memory banks according to a partition mechanism and 2) inverting the memory bank clock and allocating instructions to the adjacent memory banks in order.As for non-sequential access,a JR module was designed to repeat certain branch instructions and ensure that D2MB-ICache does not miss any jump requests.Compared with conventional ICache,D2MB-ICaches with 1× and 2× capacities increase the average maximum frequency by 14.6% and 6.8%,and decrease the average execution time by 10.3% and 3.8%,respectively.The performance of quadruple-size D2MB-ICache was close to that of conventional ICache without capacity expansion.Moreover,D2MB-ICaches with different capacities (16 kB,32 kB,64 kB,and 128 kB) reduced the energy consumption by 0.5%,16.1%,24.3%,and 24.8%,respectively.The proposed scheme shows a significant advantage in performance,when the programs have a low taken branch instruction ratio.Furthermore,since the growth rate of energy consumption of D2MB-ICache is significantly lower than that of conventional ICache,the proposed architecture has an advantage in large-capacity ICaches.

Disclosures

The authors declare no conflicts of interest.

Journal of Electronic Science and Technology2021年4期

Journal of Electronic Science and Technology的其它文章: Feature Extraction Approach for Defect lnspection in Eddy Current Pulsed Thermography; Real-Time Control System Adopted to Energy Storage for Smart Grid Low Wind Applications:A Part of Distributed Renewables in Smart City; Balanced Functional Maps for Three-Dimensional Non-Rigid Shape Registration; Device-Free Through-the-Wall Activity Recognition Using Bi-Directional Long Short-Term Memory and WiFi Channel State lnformation; Effect of Wall Thicknesses on Broadband Design of Ka-Band TE21-Mode Coupler; Effects of Material and Dimension on TCF,Frequency,and Q of Radial Contour Mode AlN-on-Si MEMS Resonators