Why IFTTT doesn’t work for emailing files to Google Drive
When I launched my new service Email It In for sending attachments to Google Drive, a couple of friends pointed out that IFTTT could do the same thing. Intrigued, I signed up for IFTTT to find out how well it worked.
Here’s the problems I found:
- I signed up with a tagged address (as I do for every web site): helpme+<tag>@gmail.com so I could track more easily any emails sent from IFTTT.
- The way emailing IFTTT works is you have to send FROM your sign up email address to trigger@ifttt.com.
- I can’t send emails from anywhere easily using tags in the From address.
- I can’t give out my ifttt.com email address to family to put files into my Google Drive, because it’s just trigger@ifttt.com.
Furthermore, IFTTT makes absolutely zero guarantees about how it processes those files. Are they written to disk? How is it processed?
IFTTT is a great service, but just doesn’t compare to emailitin.com in this situation.
Email It In – Email files to Google Drive, Skydrive or Dropbox
This weekend I am finally able to proudly release my little side-project. It’s a Haraka based email service which allows you to create a custom email address to send files to your Google Drive, Skydrive or Dropbox account. You simply use OAuth to connect your account (so we don’t see your password), and we do the rest. Then email your files to the address we give you.
Give it a try, it’s free and I’d love to hear any feedback about it.
SockJS, multiple channels, and why I dumped socket.io
At Hubdoc we have a section of our application which requires real-time interactive feedback – when users add billers we tell them what stage we are at in fetching their documents. After hearing how awesome it was I set out to use socket.io, and indeed it worked great in development on chrome with a nicely open network. The API was also extremely easy to work with.
But we would get sporadic bug reports of it just hanging, and we had no idea why, no errors in our logs, and no way to find out exactly what was going on.
For a while I tried to evaluate engine.io and the newer 1.0 version of socket.io. Version 1.0 (vs 0.9 we had in production) uses a different method of establishing communication channels – polling first, then upgrading to websockets/flashsockets if it can. I found very little documentation and some confusing reports about hangs and problems getting it to work, so I did further research. It turned out that the Meteor.js project had switched to SockJS for similar reasons, so I decided to bite the bullet and make the switch.
One thing that I found I had to do was figure out a way to establish multiple different communication channels over the same sockjs connection. There is a module called websocket-multiplex which attempts to provide this, but unfortunately it requires you establish the channels on the server ahead of time. This wasn’t an option – I just wanted arbitrary multiplexed channels established whenever I wanted them.
The solution was to communicate using a uuid with each message, and for the server to keep track of those uuids for the lifetime of the connection using a simple closure.
The next problem was to provide an EventEmitter type interface for these channels. Creating an EventEmitter in Node is really easy, so the code for that was just:
function SockJSEmitter (conn, uuid) {
this.conn = conn;
this.uuid = uuid;
}
var EventEmitter = require('events').EventEmitter;
util.inherits(SockJSEmitter, EventEmitter);
// "emit" an event over the SockJS connection
SockJSEmitter.prototype.emit = function (event, data) {
if (event === 'newListener') return;
this.conn.write(JSON.stringify({event: event, uuid: this.uuid, data: data}));
}
// call this when we receive an event from the remote end
SockJSEmitter.prototype.emit_event = function (event, data) {
EventEmitter.prototype.emit.call(this, event, data);
}
With that module, I can route data to the right SockJSEmitter with the following code:
sockjs_server.on('connection', function (conn) {
var sockjs_uuid_map = {};
conn.on('data', function (message) {
var msg = JSON.parse(message);
if (!(msg.event && msg.uuid)) throw "Invalid message format: " + message;
// StartRobot is the first message we get from the browser
// -- we use it to setup the SockJSEmitter and associate with a uuid
if (msg.event === 'StartRobot') {
var sockjs_emitter = new SockJSEmitter(conn, msg.uuid);
var status = run_robot(msg.data, sockjs_emitter);
if (status.status === "Started") {
sockjs_uuid_map[msg.uuid] = sockjs_emitter;
}
sockjs_emitter.emit("start_status", status);
}
else {
// For every other message we route to the right
// sockjs_emitter and emit_event on it
var sockjs_emitter = sockjs_uuid_map[msg.uuid];
sockjs_emitter.emit_event(msg.event, msg.data);
}
});
})
Now we need to implement the flip side for the client – always sending JSON containing at least {event: e, uuid: u} when sending messages back and forth. This looks a little more complex as it deals with waiting for the connection to be established, and maintaining a queue of clients to be sent once the SockJS connection is established. This just uses a status variable with three states: “disconnected”, “connecting” and “connected”.
var uuid_map = {};
var sockjs = null;
var sockjs_status = 'disconnected';
var pending = [];
Robot.prototype.init = function () {
uuid_map[self.uuid] = self;
if (sockjs_status === 'disconnected') {
sockjs_status = 'connecting';
sockjs = new SockJS(app.robotServer + "/_sockjs");
sockjs.onmessage = function (e) {
if (e.type != "message") return;
var msg = JSON.parse(e.data);
var self = uuid_map[msg.uuid];
if (!self) throw "No such uuid";
var method = self['event_' + msg.event];
if (!method) throw "No such event: " + msg.event;
method.call(self, msg.data);
}
sockjs.onopen = function () {
sockjs_status = 'connected';
for (var i=0; i<pending.length; i++) {
pending[i].run();
}
pending = [];
}
sockjs.onclose = function () {
sockjs_status = 'disconnected';
}
pending.push(self);
}
else if (sockjs_status === 'connecting') {
pending.push(self);
}
else if (sockjs_status === 'connected') {
self.run();
}
}
Robot.prototype.run = function () {
this.sockjs_send("StartRobot", this.robot_data);
}
Robot.prototype.sockjs_send = function (event, data) {
sockjs.send(JSON.stringify({event: event, uuid: this.uuid, data: data});
}
// Now implement Robot.prototype.event_* = function (data)
// -- these will catch your remote events specific to this instance of Robot.
So far I’m happy with the transition – XHR polling is working well from IE, and WebSockets are working from browsers that support them, and hopefully now we’ll get less bug reports about mysterious things just not working. Will keep you posted.
ANN: Haraka v2.1.0
Upgrading from 2.0.x is as simple as “npm install -g Haraka”.
Full list of changes:
- Fix restart bug which caused outbound queue to be loaded multiple times
- Allow get_mx hook to also set outbound IP address to bind to, allowing switching of outbound IP address
- Support for listening on multiple IPs/Ports in smtp.ini (requires the new “listen=” directive)
- Allow continuation lines in .ini files
- Fix require() in plugins to work more like people expect (allow require(‘./lib/foo’) in plugins dir for example)
- Require authentication if port is 587
- Fix qmail-queue to work on node 0.8+
- Allow for SMTP AUTH in smtp_forward and smtp_proxy plugins
- Added message_stream.get_data() method to get the mail as a string
As usual, please report any issues on the Haraka bug tracker at https://github.com/baudehlo/Haraka/issues
Render PDFs on the Server with PDF.JS and Node-Canvas
It occurred to me today that the new PDF.js library just makes calls into a canvas element on the document to render its content. So why couldn’t we change that to render directly to a node-canvas version of the canvas?
Turns out you can, and it’s really easy.
I made minor changes to pdf.js to get rid of the assumption of the browser, and added in a few Node.js specific requirements. In all the diff is less than a page.
Then I could just call pdf.js as though we had a local canvas. Here’s the code that works for me:
"use strict";
var Canvas = require('canvas');
var PDFJS = require('./pdf.js');
var fs = require('fs');
PDFJS.disableWorker = true;
fs.readFile(process.argv[2], function (err, data) {
var data_array = new Uint8Array(data);
PDFJS.getDocument(data_array).then(function (pdf) {
pdf.getPage(1).then(function (page) {
var scale = 1.5;
var viewport = page.getViewport(scale);
var canvas = new Canvas(viewport.width, viewport.height);
var ctx = canvas.getContext('2d');
page.render({canvasContext: ctx, viewport: viewport}).then(function () {
console.log("Finished rendering?");
var png = fs.createWriteStream('/tmp/test.png');
canvas.pngStream().pipe(png);
}, function (err) {
console.log("Got error: " + err.stack);
});
});
});
});
Performance isn’t terrible either – but I’d have to do more benchmarking to compare it to something like mupdf or xpdf for generating images from pages of PDFs
The big question is just on appearance – on the Mac I used it messes up the fonts a fair bit. This may be just issues with Cairo on the Mac, or it may be a more fundamental problem. I get much better results with mupdf at this time.
How an Event Loop works
Getting into coding in Node.js came very naturally to me. Prior to this I had written a fair bit of Danga::Socket in Perl and various solutions on top of it. What that meant is that coding in Node came naturally to me – Danga::Socket is an Event Loop for Perl, so I understood the internals and how everything hung together in an Event Loop. But it’s not that way for a large number of people coming to Node from other languages.
So let’s start with the basics of how an event loop works.
I’m going to start with simple timers – in Node/Javascript this is the concept of setTimeout() and setInterval(). These simply run code at a later time, a number of milliseconds in the future.
If the current time in milliseconds since the epoch is 1000 (yes I know that’s nowhere near what it really is), and we ask for a setTimeout() to run in 25ms time, that means the fire time for that event is 1025. We can loop over and over, checking if that fire_time <= current_time, and fire events that have past.
So the basics of our event loop become the following code:
while (1) {
// run all setTimeout() code that has past the current time
_run_timers();
}
This would loop forever, running timers that are due to fire. However a weakness would be that it is what we call a “busy loop” – it burns 100% CPU if there are no timers to fire. To improve this, we make _run_timers() return the timeout until the next timeout has to fire:
while (1) {
var next_timeout = _run_timers();
ms_sleep(next_timeout);
}
Consider here that ms_sleep() just puts the process to sleep until the next timeout is due to fire (in C we would use “usleep()”).
Two important things become obvious out of this:
- Timers aren’t time-accurate – they may run slightly late (though never early).
- Blocking the event loop is bad because other timers don’t get to run
Now we need a data structure that can keep our timers. For this we can use use an array of Timer objects:
var Timers = []; // list of current Timers
// our Timer object constructor
function Timer (fire_time, cb) {
this.fire_time = fire_time; // in epoch-ms
this.callback = cb;
}
We are assuming that the Timers array is in the order that events will fire. Now we can implement _run_timers() as follows:
function _run_timers () {
var now = Date.now();
while (Timers.length > 0 && Timers[0].fire_time <= now) {
var to_run = Timers.shift();
if (to_run->callback) to_run->callback();
}
if (Timers.length === 0) return -1;
return (Timers[0].fire_time - now);
}
If we assume that the current time (now) is 1000, and the Timers array contains entries with fire_now values of: [980, 985, 995, 1005, 1010] – the end result is we run the first three callbacks, Timers is left containing [1005, 1010], and we return 5 from our function. We sleep for 5 milliseconds and call _run_timers() again.
Now all we have to do is implement setTimeout() so it keeps that array in order. Here’s a copy of the implementation in Danga::Socket converted to Javascript:
function setTimeout (cb, ms) {
var fire_time = ms + Date.now();
var timer = new Timer (fire_time, cb);
// Optimise for the case where a timer fires after
// all current timers
if (Timers.length === 0 || fire_time >= Timers[Timers.length - 1].fire_time) {
Timers.push(timer);
return timer;
}
// Otherwise find where we insert linearly:
for (var i = 0; i < Timers.length; i++) {
if (Timers[i].fire_time > fire_time) {
Timers.splice(i, 0, timer);
return timer;
}
}
throw "Should never get here"
}
So now we have everything that we need to implement an event loop in Javascript, at least only implementing timers. Implementing setInterval() is as simple as:
function setInterval (cb, ms) {
var _f = function () {
cb();
setTimeout(_f, ms);
};
setTimeout(_f, ms);
}
And we can cancel timers by clearing their .callback entry (they will remain in the array, but that’s not a big deal for this example).
…
Lets expand on this to deal with how Node provides us with events on sockets and files. We can mostly assume that aside from Timers we are only dealing with File Descriptors (FDs). This is a misnomer, so don’t focus on the name – they are part of the unix concept of “Everything is a file” and abstract away many different systems in a modern Unix, not just files. Let’s now add to our global objects a mapping of FDs to “things” which can handle events on our file descriptors:
var FDMap = {};
Every time Node opens a file, or creates a server, or gets a connection on a server, all these things create new FDs, and must be added to that mapping table:
function _seen_new_fd (fd) {
FDMap[fd] = new FDHandler(fd);
}
So now what? How do we know when there are these “events” on file descriptors? Well luckily the Kernel (be it Linux, BSD/OSX, Solaris or Windows) has ways of letting us know something happened. Let’s focus on one of those methods as Node hides all the gritty details for us anyway. In Linux we use something called “epoll”. What epoll does is allows us to setup a structure which monitors those file descriptors and notifies us of events on them. The key events are “read” and “write” but there are others too.
Now we can change our original event loop to look like this:
for (var fd in FDMap) {
add_to_epoll_set(fd);
}
while (1) {
var timeout = _run_timers(); // remember this from above?
var events = epoll_wait(Epoll, 1000, timeout);
for (var i = 0; i < events.length; i++) {
var event = events[i];
// event is an array of [fd, state]
var fd_handler = FDMap[event[0]];
var state = event[1];
if (state & EPOLLIN) fd_handler.do_read();
if (state & EPOLLOUT) fd_handler.do_write();
}
}
I will skip over further implementation details, but this is the basics of an epoll-based event loop, with a LOT of error checking removed for clarity purposes.
- First we run the timers and get the timeout
- Then we call epoll_wait() passing in the timeout returned from _run_timers
- Then we process the events returned, which map to file descriptors
- Then we loop back and do it all again
…
OK so how does this all relate to your HTTP server written in Express? We’re a million miles abstracted from that.
At the low level your HTTP server is just a socket that is listening on a port (usually 80 for http, or 443 for https). In Unix terms a socket is represented by an FD, so is part of the descriptor map (FDMap) above.
When a connection comes in from the internet, it tells the kernel, to tell epoll_wait, that a “read” event fired on that FD.
When we get that “read” event, node accepts the connection and passes it to your Javascript program. But it does more than that too – an accepted connection gives you another FD – one that represents that particular connection. And as we said earlier – every FD that node sees gets added to the FDMap, and to the event loop (epoll). Node keeps track of all these FDs for you.
So when we get data sent to us over that connection, it fires the “read” event on it. In HTTP terms this means that we probably have something like:
GET /path/to/resource HTTP/1.1 Host: app.hubdoc.com Connection: close
Now most users of Node will be using some sort of HTTP layer – which will parse this data automatically. If you were using Node’s “Net” library you would have to do it yourself.
After parsing the HTTP commands, Express then translates that into a Route using a lookup table, and the rest is up to your application.
…
What about writing?
Writing data in this system is a little more confusing, because the “write” event isn’t really telling you that you SHOULD write, just that there’s space left in the kernel buffers so that you CAN write. More importantly, your application never cares about “write” events because Node handles it all internally. Once you understand that it gets a bit more simple. There’s multiple levels of buffering occurring – Node has to buffer data that your application has tried to write that didn’t get to the kernel, and send it when the epoll system tells it that it can.
…
In summary this was a VERY high level overview of how event loops work, but skipping over a LOT of the details of error checking, edge cases, and the reality of how things really work low down. But it should give the average Node programmer a better idea of how things work, why blocking the event loop is bad, and how setTimeout works (and thus why setTimeout isn’t “accurate” time-wise).
For more detail I recommend reading the source code to Danga::Socket: http://cpansearch.perl.org/src/BRADFITZ/Danga-Socket-1.61/lib/Danga/Socket.pm – it’s simple to read even for Perl, and the key functions are at the top of the file. It performs well and scales to hundreds of thousands of concurrent connections – I should know – I’ve pushed it that far personally.
Feel free to leave me questions or comments. I’d love to know if you found this useful.
Spam Your Users
People who know my past projects will find it hard to believe I’m suggesting that. But of course I’m not talking about true spam. I’m talking about users who have signed up for your product or project.
It’s very easy for your users to forget about you. People sign up for all kinds of things to try them out. Maintaining their interest in the long run is hard. Especially for a project like Hubdoc, where we are making people’s lives around bills and statements easier but also making it so easy that they rarely need to log in.
One thing we’ve discovered at Hubdoc, is that in inviting friends and family to try out the service, they often they will sign up for an invite and when we send the invite they completely forget about going in and trying the service out. Maybe they said they would try later and forget, or they click but decide not to do it just now, or perhaps they miss the email.
Thankfully we’re using PostgreSQL to store all our data, so finding users who hadn’t logged in, who we had sent an invite to, but who hadn’t been sent that invite in the last couple of days, was a trivial query. Sending them each a mail giving them their unique signup URL, and gently reminding them to log in, is a good way to get them to activate their accounts. As long as we don’t do this too often (and give them an opt out) they won’t mind this kind of reminder email.
This extends further as well, to users who just tried the system once, and didn’t get very far – email them to say “we notice you didn’t get very far in the system. Do you need help? Click here to log in”. Do this every once in a while and you will get them re-engaged.
Obviously there’s more to this – and lots of different reasons you can email – when you add new features or functionality, particularly taking care to email users who have requested such functionality.
Email is an incredibly powerful tool for getting in touch with your users. Use it.