Home Neural Network Reminiscence Latency Measurement Consequence – Intel Neighborhood

Reminiscence Latency Measurement Consequence – Intel Neighborhood

0
Reminiscence Latency Measurement Consequence – Intel Neighborhood

[ad_1]

Hello Guys:

 

I used to be enjoying round with the Intel Reminiscence Latency checker and later needed to put in writing my very own model of the reminiscence latency measurement program.

 

I do know that we often use pointer-chasing for reminiscence latency measurement, however I need to strive an easier technique of “flush cacheline–> file time –> mem learn addr A –> end file time”.

 

I repeat the loop many occasions. From the outcomes, I discovered three classes of latency:

  • 80-100ns, 98% of the outcomes
  • ~150-300ns, 2% of the outcome
  • >> 1us, <0.1% of the outcome.

80-100ns appears an inexpensive outcome for reminiscence latency. The >>1us ones ought to principally be attributable to interrupts/web page misses, and so forth.

 

What bothers me is these from 150-300us. They appear to occur periodically. Weakly aligned to cacheline dimension. The latency is just too massive for the DRAM shut/open web page coverage distinction, too small for the DRAM refresh interval, additionally too small for any interrupts.

 

I used to be suspecting that the “latency recording” would generate reminiscence writing that interference with the DRAM latency”. Nevertheless, even after I take away this portion, from the “high_latency_ch0” stat it nonetheless reveals  ~2% of 150-300ns vary.

 

Completely different machines behave barely in a different way.

Right here is my core operate for measuring:

 

 

 

 

    std::cout << "mem_latency experiment begin" << std::endl;
  	for (uint64_t i = 0; i < sample_count; i++) cycles_low1 );
        int32_t cycle_ch0 = static_cast<int32_t>((end1 - start1) - rdtsc_self_delay);
		sample_array_ch0[i] = cycle_ch0; //Remark out to see if latency recording trigger DRAM latency interference
        high_latency_ch0 += (cycle_ch0 > (100 * cpu_ghz)); 
		// addr = ori_addr + (i*4) % 4096; 
    

 

 

 

 

What’s extra fascinating is, that generally, you would see some kind of alignment or sample occurring within the outcome. [Full csv in attachment]

Tianchen_Jerry_Wang_0-1705765366865.png

 

I’ve tried to disable the Knowledge-Dependent Prefetcher but it surely doesn’t appear to be the explanation. I additionally disabled the DCP and L2 Prefetcher within the BIOS, but it surely additionally doesn’t appear to be associated. [Well I am not sure if the prefetcher in BIOS is useful….]

 

Right here is my CPU spec:

 

 

 

Structure:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Deal with sizes:                      46 bits bodily, 48 bits digital
CPU(s):                             14
On-line CPU(s) checklist:                0-13
Thread(s) per core:                 1
Core(s) per socket:                 14
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU household:                         6
Mannequin:                              79
Mannequin identify:                         Intel(R) Xeon(R) CPU E5-2660 v4 @ 2.00GHz
Stepping:                           1
CPU MHz:                            1200.178
CPU max MHz:                        3200.0000
CPU min MHz:                        1200.0000
BogoMIPS:                           3999.97
L1d cache:                          448 KiB
L1i cache:                          448 KiB
L2 cache:                           3.5 MiB
L3 cache:                           35 MiB
NUMA node0 CPU(s):                  0-13
Vulnerability Collect information sampling: Not affected
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Mitigation; PTE Inversion
Vulnerability Mds:                  Susceptible: Clear CPU buffers tried, no microcode; SMT disabled
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale information:      Susceptible: Clear CPU buffers tried, no microcode; SMT disabled
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec retailer bypass:    Mitigation; Speculative Retailer Bypass disabled by way of prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs limitations and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Susceptible: Clear CPU buffers tried, no microcode; SMT disabled
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe s
                                    yscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni p
                                    clmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_d
                                    eadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_
                                    ppin ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt
                                     xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d

 

 

 

 

Working out of strategies already…. I used to be working it on Debian. I’ve solely this user-level program working. Is it doable the backend kernel threads trigger these….?

 

Thanks a lot

Jerry

[ad_2]