{"id":147,"date":"2017-07-18T18:25:58","date_gmt":"2017-07-18T16:25:58","guid":{"rendered":"https:\/\/next-hack.com\/?p=147"},"modified":"2020-06-05T20:17:25","modified_gmt":"2020-06-05T18:17:25","slug":"how-to-play-a-20-fps-video-on-arduino-36-spi-or-usart-in-spi-mode","status":"publish","type":"post","link":"https:\/\/next-hack.com\/index.php\/2017\/07\/18\/how-to-play-a-20-fps-video-on-arduino-36-spi-or-usart-in-spi-mode\/","title":{"rendered":"How to play a 20 fps video on Arduino 3\/6: SPI or USART in SPI mode?"},"content":{"rendered":"<h1 class=\"western\"><span lang=\"en-US\">Introduction<\/span><\/h1>\n<p><span lang=\"en-US\">Hi there!<\/span><\/p>\n<p><span lang=\"en-US\">In the previous episode we showed you how to enable 8MHz transfers on the popular 1.8\u201d Display+SD shield. If you didn\u2019t read it, go and <a href=\"https:\/\/next-hack.com\/index.php\/2017\/07\/09\/how-to-play-a-20-fps-video-on-arduino-26-enabling-high-speed-sd-transfers-on-a-1-8-tftsd-module\/\">check it<\/a>, because it will be required for this episode!!!<\/span><\/p>\n<p><span lang=\"en-US\">In this episode we want to address another issue: is the SPI module of Arduino\u2019s ATMEGA328 fast enough for our devices? <\/span><\/p>\n<p><span lang=\"en-US\">This question is legit, in fact, if you consider the time taken by the test program (either with a scope, or by using the \u201cSerial Monitor\u201d function on the Arduino IDE), you\u2019ll notice that there is something going wrong: if the SPI is clocked ad 8MHz, you would expect 1MB\/s, therefore you should read your 900kiB (924kB) test.bmp file in 0.9s, whereas it takes about 2.4s., i.e. only 375kB\/s.<\/span><\/p>\n<hr \/>\n<div id=\"attachment_133\" style=\"width: 532px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-133\" class=\"wp-image-133\" src=\"https:\/\/next-hack.com\/wp-content\/uploads\/2017\/07\/A2Fig1.jpg\" alt=\"SPI test monitor\" width=\"522\" height=\"245\" \/><p id=\"caption-attachment-133\" class=\"wp-caption-text\">Fig. 1: Serial output from the SD card test code of the previous episode.<\/p><\/div>\n<hr \/>\n<p><span lang=\"en-US\">Note: in this article we consider M = 1 million and k = 1000. ki denotes 1024 and Mi denotes 1048576, i.e. 1024 ki. For sake of simplicity, in the video we don\u2019t make this distinction.<\/span><\/p>\n<p><span lang=\"en-US\">This is well below the expected 1MB\/s! Less than 40% of the theoretical throughput!<\/span><\/p>\n<p><span lang=\"en-US\">Of course, there is some software overhead (due to the FATfs library), and when the file pointer crosses the cluster boundary, the Fatfs needs to read the FAT table, in order to know the sector at which the next cluster begins. Also, there is a transmission overhead, because we must send \u2013 through the same SPI lines \u2013 several bytes composing the command to actually ask the SD to read one or more sectors. Finally, when we send a low-level read command to the SD card, it will take some time before data become available. This time depends on the SD card and on the type of read command: read single or read multiple blocks. If we just read a single block (as in our case), this delay time might be a substantial fraction of the total time taken by the actual read-out. In our case, we verified that the card takes over 225 microsencods! That\u2019s a huge amount of time! These non-SPI related overheads are explained in the video linked at the bottom of the page.<\/span><\/p>\n<p><span lang=\"en-US\">So, to read the 900kiB of file, we need to read 1800 512-Bytes sectors. That means that 225*1800 microseconds are wasted, i.e. more than 0.4 seconds (note: reading in chunks larger that a single sector will for sure reduce the total time we are waiting). Therefore the time taken to read the actual data is about 2 seconds, i.e. we read at 450kB\/s. <\/span><\/p>\n<p><span lang=\"en-US\">This again includes some software overhead, the time taken to send the SD command and the FAT readout. Still, we are less than half of the bandwidth one could expect from a 8MHz SPI. Is this due to software-only overhead, or is it due to the SPI module, being somewhat inefficient? <\/span><\/p>\n<p><span lang=\"en-US\">We will find out that the SPI plays a major role.<\/span><\/p>\n<p><span lang=\"en-US\">In fact, the SPI has single buffered writes, that is, the transmit register (SPDR) is also actually the transmit shift register, therefore you can\u2019t write a new byte, while the SPI module is still sending the old one.<\/span><\/p>\n<p><span lang=\"en-US\">In order to see if you can send new data, you must check if the SPIF bit of the SPI status register (SPSR) is set to one. In other words, you must use the following instruction in C:<\/span><\/p>\n<p><span lang=\"en-US\" style=\"color: #999999;\"><code><span style=\"color: #999999;\">loop_until_bit_is_set(SPSR,SPIF);<\/span><\/code><\/span><\/p>\n<p><span lang=\"en-US\">Which translates into the following assembly instructions:<\/span><\/p>\n<p><span lang=\"en-US\" style=\"color: #999999;\"><code><span style=\"color: #0000ff;\"> Wait_Transmit:<\/span>\u00a0 <span style=\"color: #999999;\">in r16, SPSR;\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/span> <span style=\"color: #000000;\">(Load SPSR to register r16)<\/span><\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <span style=\"color: #999999;\">sbrs r16, SPIF;\u00a0\u00a0\u00a0<\/span>\u00a0\u00a0 <span style=\"color: #000000;\">(SPIF bit set? exit)<\/span><\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>                      \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 <span style=\"color: #999999;\">rjmp Wait_Transmit;\u00a0 <\/span><span style=\"color: #000000;\">(jump to the first instruction)<\/span><\/code><\/span><\/p>\n<p>(Note: We used r16 as example. The actual register number used will depend on your program. The particular register used will not affect the timings).<\/p>\n<p>The first instruction (<span style=\"color: #999999;\"><code>in r16,SPSR<\/code><\/span>) takes 1 clock cycle. The second one takes 1 clock cycle if the bit was not set, otherwise, 2 clock cycles. The third one takes 2 clock cycles.<\/p>\n<p>As you can see, in the best case condition, you must wait 3 clock cycles before you can send another data (only the first two instructions would be executed), which will start the one cycle later.<\/p>\n<p>Therefore you waste at least 4 CPU clock cycles when sending one byte, which is 16 cpu clock cycles long. I.e. at the best case, you need 20 cpu clock cycles to transmit 8 bits. In other words, instead of sending at 1MBps you\u2019re sending at only 0.8 MBps. This is not even close to what we reached in our example.<\/p>\n<p>Still, to save those 4 lost CPU cycles, one might be tempted in writing the following trivial code:<\/p>\n<p><code>Write data to SPDR.<br \/>\nWait Exactly the minimum number of cycles required for the transfer to complete.<br \/>\nWrite again new data to SPDR.<\/code><\/p>\n<p>If you wanted to transmit data at the full sustained 8Mbps data rate, you should wait 15 CPU cycles between two consecutive writes to SPDR, so that for each byte you have 15 wait cycles + 1 write cycle (OUT instruction). In other words, 16 CPU cycles that is 1 us every 8 bit (i.e. 8Mbps or 1MBps. Please note that 1MBps = 1Mbps, as the B stands for byte and b stands for bit).<\/p>\n<p>You\u2019ll find that 15 wait cycles are not enough: if you only waited 15 cycles, you would incur to a write collision, because of the SPI module particular implementation, as you can see from the scope in the next figures.<\/p>\n<p>If you write this code, and you comment out (\/\/) even a single NOP line, not only you won\u2019t gain any speed improvement, but also you\u2019ll get half the speed (and of course, this will not work, as one byte out of two will be skipped)!<\/p>\n<p><code><span style=\"color: #999999;\">SPSR |= 0x01;\u00a0\u00a0<\/span>\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0 <span style=\"color: #000000;\">\/\/ SPI 2x<\/span><br \/>\n<span style=\"color: #999999;\">SPCR = 0b01010000;\u00a0\u00a0\u00a0<\/span>\u00a0 <span style=\"color: #000000;\">\/\/ 8M<\/span><br \/>\n<span style=\"color: #999999;\">asm volatile<\/span><span style=\"color: #999999;\"><br \/>\n(<\/span><\/code><br \/>\n<code><span style=\"color: #000000;\"> \u00a0 \/\/if you delete even a single nop, the code will not work as expected!<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"transmit%=:\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"OUT %[spdr],__tmp_reg__\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">  \u00a0 \"NOP\" \"\\n\\t\"<\/span>\u00a0\u00a0\u00a0\u00a0\u00a0<span style=\"color: #000000;\"> \/\/ sixteenth NOP<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"OUT %[spdr],__tmp_reg__\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"NOP\" \"\\n\\t\"<\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">\u00a0 \"RJMP transmit%=\" \"\\n\\t\"\u00a0\u00a0 <span style=\"color: #000000;\">\/\/ only 14 NOP's, because this takes 2 cycles<\/span><\/span><\/code><br \/>\n<code><span style=\"color: #999999;\">: <span style=\"color: #000000;\">\/* no outputs*\/<\/span> : [spdr] \"I\" (_SFR_IO_ADDR(SPDR)) : );<\/span><\/code><\/p>\n<p>Therefore it turns out you must wait for 16 cycles, i.e. 17 CPU cycles per byte, which gives a theoretical maximum throughput achievable with the AVR SPI (when the CPU clock is 16MHz and the SPI clock is 8MHz) of 0.94 MBps.<\/p>\n<p>&nbsp;<\/p>\n<hr \/>\n<div id=\"attachment_136\" style=\"width: 494px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-136\" class=\"wp-image-136\" src=\"https:\/\/next-hack.com\/wp-content\/uploads\/2017\/07\/A2Fig2.jpg\" alt=\"SPI, 16 NOPs\" width=\"484\" height=\"272\" \/><p id=\"caption-attachment-136\" class=\"wp-caption-text\">Fig. 2. Clock waveform measured using 16 NOPs on the code shown above.<\/p><\/div>\n<hr \/>\n<p>&nbsp;<\/p>\n<hr \/>\n<div id=\"attachment_137\" style=\"width: 496px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-137\" class=\"wp-image-137\" src=\"https:\/\/next-hack.com\/wp-content\/uploads\/2017\/07\/A2Fig3.jpg\" alt=\"SPI test, 15 NOPs\" width=\"486\" height=\"274\" \/><p id=\"caption-attachment-137\" class=\"wp-caption-text\">Fig. 2. Clock waveform measured using 15 NOPs on the code shown above. A byte out of two is missing!<\/p><\/div>\n<hr \/>\n<p>This solution has one drawback, with respect to the<span style=\"color: #999999;\"> <code>loop_until_bit_is_set()<\/code> <\/span>function: if an interrupt occurs during a transmission, the CPU will always wait 16 cycles, regardless if the SPI already finished sending the previous data (while the interrupt was in execution). With <span style=\"color: #999999;\"><code><span style=\"color: #999999;\">loop_until_bit_is_set()<\/span><\/code><\/span> this problem might be partially mitigated, especially, as we\u2019ll see later, when we will use the USART instead.<\/p>\n<p>So now, knowing that we need 17 CPU cycles for each byte, we can determine EXACTLY how many cycles will be employed by the following piece of code, needed to send one character and wait until the SPSR is ready to accept a new byte:<\/p>\n<p><span lang=\"en-US\"><code>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <span style=\"color: #999999;\">out SPDR, r17;\u00a0<\/span>\u00a0\u00a0\u00a0\u00a0\u00a0 (write data to be sent)<\/code><br \/>\n<code><span style=\"color: #0000ff;\">Wait_Transmit:<\/span>\u00a0 <span style=\"color: #999999;\">in r16, SPSR;<\/span>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 (Load SPSR to register r16)<\/code><\/span><br \/>\n<code>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <span style=\"color: #999999;\">sbrs r16, SPIF;\u00a0\u00a0<\/span>\u00a0\u00a0\u00a0 (SPIF bit set? exit)<\/code><br \/>\n<code>                      \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 <span style=\"color: #999999;\">rjmp Wait_Transmit;\u00a0<\/span> (jump to the first instruction)<\/code><\/p>\n<p>Note: in r17 we have the data to be sent.<\/p>\n<p>We know that the SPIF bit will be high 16 cycles after the OUT instruction. So this is the sequence of instructions which are actually executed by the CPU:<\/p>\n<table>\n<tbody>\n<tr>\n<td style=\"text-align: center;\" width=\"172\"><strong><span style=\"text-align: center;\">Instruction<\/span><\/strong><\/td>\n<td style=\"text-align: center;\" width=\"166\"><strong><span style=\"text-align: center;\">Current cycle<br \/>\n(before instruction is executed)<\/span><\/strong><\/td>\n<td style=\"text-align: center;\" width=\"151\"><strong><span style=\"text-align: center;\">SPIF in SPDR<br \/>\nwhen instruction is executed<\/span><\/strong><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">out<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">0<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">1<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">in<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">1<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">sbrs<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">2<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">rjmp (cycle 1)<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">3<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">rjmp (cycle 2)<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">4<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">in<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">5<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">sbrs<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">6<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">rjmp (cycle 1)<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">7<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">rjmp (cycle 2)<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">8<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">in<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">9<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">sbrs<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">10<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">rjmp (cycle 1)<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">11<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">rjmp (cycle 2)<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">12<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">in<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">13<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">sbrs<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">14<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">rjmp (cycle 1)<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">15<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">in<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">16<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">0<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">sbrs<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">17<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">1<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">rjmp (cycle 1)<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">18<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">1<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">rjmp (cycle 2)<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">20<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">1<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">in<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">21<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">1<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">sbrs * (cycle 1)<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">22<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">1<\/span><\/td>\n<\/tr>\n<tr>\n<td width=\"172\">sbrs * (cycle 2)<\/td>\n<td style=\"text-align: center;\" width=\"166\"><span style=\"text-align: center;\">23<\/span><\/td>\n<td style=\"text-align: center;\" width=\"151\"><span style=\"text-align: center;\">1<\/span><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>*Note: In the last step, the sbrs instruction takes 2 cycles, because SPIF is one.<\/p>\n<p>As you can see, it takes 24 cycles (0 through 23) to send 8 bytes of data, instead of 16! That is, you\u2019re sending only at 5.3 Mbps, i.e. 0.67 MBps! Even though this value is closer to what we measured, we are still missing the part in which we actually read the received data from the SPDR register (with an \u201cin\u201d instruction) and we write this value to our buffer (with a \u201cst\u201d instruction), accounting for 3 other cycles. This would yield 590kB\/s, still much larger than we actually achieved.<\/p>\n<p>And of course, when reading multiple bytes, we must create a loop and check if we finished reading the required number of bytes, etc.<\/p>\n<p>If you are familiar with the SPI module you can also make clever optimizations, because the SPDR is double buffered in reception. For instance, the pseudo code (which works if you need to send at least 2 bytes) might look like:<\/p>\n<p><code>SPDR = 0xFF\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ; after this instruction we have 17 cycles in which we can do something else.<\/code><br \/>\n<code>Set remainingBytes to BytesToRead-1<\/code><br \/>\n<span style=\"color: #0000ff;\"><code>ReceiveLoop:<\/code><\/span><br \/>\n<code>\u00a0 Wait for SPIF bit set in SPDR register<\/code><br \/>\n<code>\u00a0 SPDR = 0xFF<\/code><br \/>\n<code>\u00a0 Read SPDR and put into the destination array <strong><em>(Note that SPDR in read and write mode are actually two different registers!)<\/em><\/strong><\/code><br \/>\n<code>\u00a0 Decrement RemainingBytes<\/code><br \/>\n<code>\u00a0 If remainingBytes &gt; 0<\/code><br \/>\n<code>\u00a0 \u00a0 Jump to ReceiveLoop<\/code><br \/>\n<code>\u00a0 Endif<\/code><br \/>\n<code>\u00a0 Wait For SPIF bit set in SPDR register<\/code><br \/>\n<code>\u00a0 Read last byte and put it into the destination array<\/code><\/p>\n<p>In C code we get:<\/p>\n<p><code><span style=\"color: #999999;\">SPDR = 0xFF;<\/span>\u00a0\u00a0\u00a0\u00a0 \/\/ after this instruction we have 17 cycles in which we can do something else.<\/code><br \/>\n<code><span style=\"color: #999999;\">remainingBytes = cnt - 1;<\/span>\u00a0\u00a0\u00a0\u00a0 \/\/ dont worry: the compiler will optimize and won't allocate a new variable!<\/code><br \/>\n<span style=\"color: #999999;\"><code>do<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>{<\/code><\/span><br \/>\n<code>\u00a0 <span style=\"color: #999999;\">loop_until_bit_is_set(SPSR, SPIF);\u00a0<\/span> \u00a0 \/\/Wait for SPIF bit set in SPDR register<\/code><br \/>\n<code>\u00a0 <span style=\"color: #999999;\">SPDR = 0xFF;<\/span><\/code><br \/>\n<code>\u00a0<span style=\"color: #999999;\"> *p++ = SPDR;<\/span>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \/\/Read SPDR and put into the destination array (Note that SPDR in read and write mode are actually two different registers!)<\/code><br \/>\n<code>\u00a0 <span style=\"color: #999999;\">remainingBytes--;<\/span>\u00a0\u00a0 \/\/ Decrement remainingBytes<\/code><br \/>\n<span style=\"color: #999999;\"><code>}<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>while (remainingBytes);<\/code><\/span><br \/>\n<code><span style=\"color: #999999;\">loop_until_bit_is_set(SPSR, SPIF);<\/span>\u00a0 \u00a0 \u00a0\/\/Wait for SPIF bit set in SPDR register<\/code><br \/>\n<code><span style=\"color: #999999;\">*p++ = SPDR;<\/span> \/\/ Read last byte and put it into the destination array<\/code><\/p>\n<p>This code is much more optimized, because instead of doing nothing during the transmission, we do a lot of other work: we read the data, we store it into our array, we decrement the counter, we check if there are remaining bytes left and we make the jump to the beginning of the loop.<\/p>\n<p>And the time we get now is 1.86s (495kB\/s). Taking into account the 0.4s overhead due to the card delay, we have 1.46s, i.e. 0.63kB\/s, which is very close to the 0.67 MB\/s that we calculated before (note we still have the other software overhead and the timing previously shown in the table is no longer valid, as we are no more calling \u201cloop_until_bit_is_set\u201d just after writing to SPDR).<\/p>\n<p>&nbsp;<\/p>\n<hr \/>\n<div id=\"attachment_138\" style=\"width: 509px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-138\" class=\"size-full wp-image-138\" src=\"https:\/\/next-hack.com\/wp-content\/uploads\/2017\/07\/A2Fig4.png\" alt=\"\" width=\"499\" height=\"379\" \/><p id=\"caption-attachment-138\" class=\"wp-caption-text\">Fig. 4: Serial output of the SPI optimized Code<\/p><\/div>\n<hr \/>\n<p>The scope shows clearly the optimized throughput:<\/p>\n<hr \/>\n<div id=\"attachment_139\" style=\"width: 492px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-139\" class=\" wp-image-139\" src=\"https:\/\/next-hack.com\/wp-content\/uploads\/2017\/07\/A2Fig5.jpg\" alt=\"\" width=\"482\" height=\"271\" \/><p id=\"caption-attachment-139\" class=\"wp-caption-text\">Fig. 5: Clock waveform with optimiyed SPI code.<\/p><\/div>\n<hr \/>\n<p>In fact, two bytes are sent in 2.9 microseconds, that is, 690kB\/s. Much better than before, but still not enough. For those who are curious to see the performance of the code of the past episode, here is the measurement with the scope:<\/p>\n<hr \/>\n<div id=\"attachment_140\" style=\"width: 495px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-140\" class=\" wp-image-140\" src=\"https:\/\/next-hack.com\/wp-content\/uploads\/2017\/07\/A2Fig6.jpg\" alt=\"\" width=\"485\" height=\"273\" \/><p id=\"caption-attachment-140\" class=\"wp-caption-text\">Fig. 6: Clock waveform with the original unoptimized code of the last post.<\/p><\/div>\n<hr \/>\n<p>In Fig 6, we are sending 2 bytes in about 4 microseconds, i.e. 500kB\/s. In other words, this code alone was responsible for the loss of HALF of the loss of performances!<\/p>\n<h1>The USART in SPI MODE<\/h1>\n<p>Luckily, with the USART in SPI mode we can send data without any gap between consecutive bytes.<\/p>\n<p>In fact, the USART has a double buffered transmission (and read) register: i.e. there is a separate transmit register and transmit buffer (the latter is not accessible by the CPU). In other words, we can write fresh data even if the USART is still sending the previous one, provided that the transmit register is empty (i.e. regardless if the transmit buffer is full).<\/p>\n<p>This is very useful, because in this way we can provide the USART module with new data before it is \u201ctoo late\u201d, and continuous transmission (without gaps) can occur. In fact, when the USART transmits the last bit of the transmit buffer, it seamlessly copy the transmit register to the transmit buffer, so it can proceed (i.e. exactly after the last bit has been outputted) with the first bit of the new data just after the first byte as been sent.<\/p>\n<p>In this way we can safely use the loop_until_bit_is_set() (to check if the transmit register is empty), being sure that there won\u2019t be a transfer gap. In fact, even if the \u201cloop_until_bit_is_set()\u201d might take up to 6 clock cycles (the worst case is when the bit is set just after the first \u201cin\u201d instruction, and this would yield 6 CPU cycles), the USART will be still sending data, and you have 10 other more CPU cycles to write to the transmit buffer, before the transmitter goes idle (i.e. before transmit gap would occur).<\/p>\n<p>Therefore, the C code for sending data becomes:<\/p>\n<p><code><span style=\"color: #999999;\">cnt -= 2;<\/span>\u00a0\u00a0 \/\/ next two transmissions are performed out of the do\u2026while loop.<\/code><br \/>\n<span style=\"color: #999999;\"><code>UDR0 = 0xFF;<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>loop_until_bit_is_set(UCSR0A, UDRE0);<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>UDR0 = 0xFF;<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>do<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>{<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>\u00a0 loop_until_bit_is_set(UCSR0A, RXC0);<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>\u00a0 *p++ = UDR0;<\/code><\/span><br \/>\n<code>\u00a0 <span style=\"color: #999999;\">UDR0 = 0xFF;<\/span><\/code><br \/>\n<code>\u00a0<span style=\"color: #999999;\"> cnt--;<\/span><\/code><br \/>\n<span style=\"color: #999999;\"><code>} while (cnt);<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>loop_until_bit_is_set(UCSR0A, RXC0);<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>*p++ = UDR0;<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>loop_until_bit_is_set(UCSR0A, RXC0);<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>UCSR0A |= TXC0;<\/code><\/span><br \/>\n<span style=\"color: #999999;\"><code>*p++ = UDR0;<\/code><\/span><\/p>\n<p>With this new arrangement, there is no gap between bytes as you can see from the scope.<\/p>\n<hr \/>\n<div id=\"attachment_141\" style=\"width: 527px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-141\" class=\" wp-image-141\" src=\"https:\/\/next-hack.com\/wp-content\/uploads\/2017\/07\/A2Fig7.jpg\" alt=\"\" width=\"517\" height=\"291\" \/><p id=\"caption-attachment-141\" class=\"wp-caption-text\">Fig. 7: Clock output with the USART in SPI mode. No gap is generated between bytes!<\/p><\/div>\n<hr \/>\n<p>Furthermore, measuring the delay between the blue LED turning on (yellow trace) and the red LED turning off (blue trace), we get the total time taken by our test program to read the 900kiB file.<\/p>\n<hr \/>\n<div id=\"attachment_142\" style=\"width: 527px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-142\" class=\" wp-image-142\" src=\"https:\/\/next-hack.com\/wp-content\/uploads\/2017\/07\/A2Fig8.jpg\" alt=\"\" width=\"517\" height=\"291\" \/><p id=\"caption-attachment-142\" class=\"wp-caption-text\">Fig. 8: Measuring the time taken by reading the 900kiB file.<\/p><\/div>\n<hr \/>\n<p>We get 1.47s, i.e. 627kB\/s. Taking into account the 0.4 s (which, once again, does not depend on the clock speed, but on the card and the number of sectors read at once), we get 0.86kB\/s. This number can be still improved a little bit if we read in larger chunks, we promise!<\/p>\n<p>Of course, this does not come without any drawbacks. In fact, you\u2019ll lose the ability to send data through the UART, which is used by Arduino Uno for the serial USB transmission (in fact now we use a scope to determine the transfer speed) and programming. Therefore you\u2019ll need to disconnect the RXD wire (or the shield you\u2019re using) when you need to reprogram the Arduino.<\/p>\n<p>Another drawback in Arduino Uno is that the ATMEGA328\u2019s USART is connected to the ATMEGA16U8 through 1-kOhm resistors. Therefore some problem might arise (we\u2019ll see this in the next episode), if the ATMEGA16U8 interferes. We will teach you a trick to overcome this issue in the next episode!<\/p>\n<p>Despite all these drawbacks, we have another small advantage! If we don\u2019t use the SPI, we do not need to use the so stupidly misplaced IOH connector (how irritating is the fact that they didn\u2019t place it on a 0.1\u201d grid like the other connectors?!?). We can therefore use normal veroboards for our shields! Good!<\/p>\n<p>Finally, here are the step to replicate our results!<\/p>\n<h1>\u00a0The hack!<\/h1>\n<h3>Prerequisites<\/h3>\n<p>As shown in the previous episode you need a 900kiB (640&#215;480 24 bit) bitmap file named test.bmp in your SD card. You also need the hack of the previous episode and you need the Arduino installed on your system. Of course also you need an Arduino J.<\/p>\n<h3>Step 1:<\/h3>\n<p>Download the <a href=\"https:\/\/next-hack.com\/wp-content\/uploads\/2017\/07\/sd-test-usart.zip\" target=\"_blank\" rel=\"noopener\">USART test-sd sketch<\/a>.<\/p>\n<h3>Step 2:<\/h3>\n<p>Prepare the circuit shown in the schematics below. The breadboard image show a possible arrangement. Pease note that again we used the wires to connect the SD+Display module just to show you that there are resistors! In practice (see actual figure) we put the module over the 1.5k resistors! Please also note that, unlike the previous episode, we have also a fourth 1.5k resistor, connected to the MISO line. This is due because the ATMEGA16U2 (i.e. the uart to USB built in converter) is powered at 5V and its TXD pin is connected to the RXD pin of the ATMEGA328P with a 1k resistor.<\/p>\n<hr \/>\n<div id=\"attachment_143\" style=\"width: 603px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-143\" class=\" wp-image-143\" src=\"https:\/\/next-hack.com\/wp-content\/uploads\/2017\/07\/A2Fig9.png\" alt=\"\" width=\"593\" height=\"438\" \/><p id=\"caption-attachment-143\" class=\"wp-caption-text\">Fig. 9: Schematics of the SD connection to the USART.<\/p><\/div>\n<hr \/>\n<div id=\"attachment_134\" style=\"width: 406px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-134\" class=\" wp-image-134\" src=\"https:\/\/next-hack.com\/wp-content\/uploads\/2017\/07\/A2Fig10.png\" alt=\"\" width=\"396\" height=\"445\" \/><p id=\"caption-attachment-134\" class=\"wp-caption-text\">Fig. 10: Example of breadboard layout for the connection of the SD to the USART. The image was created using Fritzing (http:\/\/fritzing.org).<\/p><\/div>\n<hr \/>\n<p>This is our actual breadboard implementation.<\/p>\n<hr \/>\n<div id=\"attachment_135\" style=\"width: 513px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-135\" class=\" wp-image-135\" src=\"https:\/\/next-hack.com\/wp-content\/uploads\/2017\/07\/A2Fig11.jpg\" alt=\"\" width=\"503\" height=\"283\" \/><p id=\"caption-attachment-135\" class=\"wp-caption-text\">Fig. 11: Actual breadboard implementation.<\/p><\/div>\n<hr \/>\n<h3>Step 3:<\/h3>\n<p>If you want to measure the time taken by the new program, you have two possibilities: use a two channel scope, and connect one channel to the output connected to the red LED and the other to the output connected to the blue LED. Measure the time between when the blue LED turns on a when the red LED turns off. Otherwise, use a camera\/phone and take a video of the LEDs.\u00a0 Use then a program such as virtualDub or AviDemux to analyze (with accuracy depending on your frame rate, tyipically 1\/30fps = 33ms) this time!<\/p>\n<p>That\u2019s all for now. Don\u2019t miss the next episode! After all this wait, we will be actually playing the video!!!<\/p>\n<p>Be sure to watch the <a href=\"https:\/\/youtu.be\/Aq1xrUbsRDk\" target=\"_blank\" rel=\"noopener\">video of this episode<\/a> on our youtube channel! Also, rate, comment, share and <a href=\"https:\/\/www.youtube.com\/channel\/UCH6TFYuFH6dt1wj4SCZJ1xA\" target=\"_blank\" rel=\"noopener\">subscribe<\/a>!<\/p>\n<p><iframe loading=\"lazy\" src=\"https:\/\/www.youtube.com\/embed\/Aq1xrUbsRDk\" width=\"560\" height=\"315\" frameborder=\"0\" allowfullscreen=\"allowfullscreen\"><\/iframe><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction Hi there! In the previous episode we showed you how to enable 8MHz transfers on the popular 1.8\u201d Display+SD shield. If you didn\u2019t read it, go and check it, because it will be required for this episode!!! In this&#8230; <a class=\"read-more-button\" href=\"https:\/\/next-hack.com\/index.php\/2017\/07\/18\/how-to-play-a-20-fps-video-on-arduino-36-spi-or-usart-in-spi-mode\/\">(READ MORE)<\/a><\/p>\n","protected":false},"author":1,"featured_media":134,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[29],"tags":[],"class_list":["post-147","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-video-arduino-uno"],"_links":{"self":[{"href":"https:\/\/next-hack.com\/index.php\/wp-json\/wp\/v2\/posts\/147"}],"collection":[{"href":"https:\/\/next-hack.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/next-hack.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/next-hack.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/next-hack.com\/index.php\/wp-json\/wp\/v2\/comments?post=147"}],"version-history":[{"count":48,"href":"https:\/\/next-hack.com\/index.php\/wp-json\/wp\/v2\/posts\/147\/revisions"}],"predecessor-version":[{"id":268,"href":"https:\/\/next-hack.com\/index.php\/wp-json\/wp\/v2\/posts\/147\/revisions\/268"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/next-hack.com\/index.php\/wp-json\/wp\/v2\/media\/134"}],"wp:attachment":[{"href":"https:\/\/next-hack.com\/index.php\/wp-json\/wp\/v2\/media?parent=147"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/next-hack.com\/index.php\/wp-json\/wp\/v2\/categories?post=147"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/next-hack.com\/index.php\/wp-json\/wp\/v2\/tags?post=147"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}