Questions on DTX and KV Puts values visibility


ping.wong@...
 

Hi All,
 
In a two nodes cluster (one master/leader and one replica), I wrote a client application with multiple writer and reader threads against same object with different keys and values:
1. Writers issue daos_kv_put(oh, DAOS_TX_NONE, ..., &ev); call daos_event_test(&ev, DAOS_EQ_WAIT, &ev_flag);                // wait for IO completion
2. Readers issue daos_kv_get(oh, DAOS_TX_NONE, 0, key, &size, buf, &ev); call daos_event_test(&ev, DAOS_EQ_WAIT, &ev_flag); // wait for IO completion
 
I observed that daos_kv_get returns older version occasionally.  I have a few questions regarding DTX and the visibility of values obtained from the client's perspective.  
 
1. On the master/leader server, when it sends RPC reply to the client for the daos_kv_put(); 
     1.1 Has the put been committed before returning to the client?  If not, what is the typical asynchronous commit threshold to make the put values visible to other clients? 
     1.2 Under what condition(s), does the leader commit the transaction?
     1.3 Does the leader send RPC to replica to commit the transaction synchronously or asynchronously after replying to the client?
   
2. On the master server, how long does it wait (timeout) for the replica to reply? 
    2.1 Does the master server send RPC to replica if the leader does not get reply from replica (e.g. when replica dies)?
    2.2 Under what condition(s) does the master server commit synchronously?

The following is a partial output from the client application:
 
ReadThreadFunc thread 2 key=keyabcde00000006 buf=Value00000001
ReadThreadFunc thread 0 key=keyabcde00000004 buf=Value00000001
ReadThreadFunc thread 0 key=keyabcde00000004 buf=Value00000001
ReadThreadFunc thread 1 key=keyabcde00000005 buf=Value00000005
ReadThreadFunc thread 1 key=keyabcde00000005 buf=Value00000005
WriteThreadFunc thread 0 put key=keyabcde00000004 buf=Value00000003                             Put older version <--------+
WriteThreadFunc thread 1 put key=keyabcde00000006 buf=Value00000003                                                                     |
WriteThreadFunc thread 2 put key=keyabcde00000007 buf=Value00000004                                                                     |
ReadThreadFunc thread 2 key=keyabcde00000006 buf=Value00000002                                                                           |
ReadThreadFunc thread 2 key=keyabcde00000006 buf=Value00000002                                                                           |
ReadThreadFunc thread 0 key=keyabcde00000004 buf=Value00000002                                                                           |
ReadThreadFunc thread 0 key=keyabcde00000004 buf=Value00000002                                                                           |
ReadThreadFunc thread 1 key=keyabcde00000005 buf=Value00000005                                                                           |
ReadThreadFunc thread 1 key=keyabcde00000005 buf=Value00000005                                                                           |
WriteThreadFunc thread 0 put key=keyabcde00000004 buf=Value00000004                                  Put newer version      |
WriteThreadFunc thread 1 put key=keyabcde00000006 buf=Value00000004                                                                     |
WriteThreadFunc thread 2 put key=keyabcde00000007 buf=Value00000005                                                                     |
ReadThreadFunc thread 2 key=keyabcde00000006 buf=Value00000003                                                                           |
ReadThreadFunc thread 2 key=keyabcde00000006 buf=Value00000003                                                                           |
ReadThreadFunc thread 0 key=keyabcde00000004 buf=Value00000003                                  Get older version  <--------+
ReadThreadFunc thread 0 key=keyabcde00000004 buf=Value00000003
 
                                            ...
                        
WriteThreadFunc thread 2 put key=keyabcde00000003 buf=Value00000005                               Put older version <---------------------------+
Event Write thread 0 started buf_size=4096 start=1 end=5                                                                                                                           |
WriteThreadFunc thread 1 put key=keyabcde00000002 buf=Value00000004                                                                                              |
Event Read thread 2 started buf_size=4096 start=3 end=7                                                                                                                          |
WriteThreadFunc thread 2 put key=keyabcde00000004 buf=Value00000001                                                                                              |
Event Read thread 1 started buf_size=4096 start=2 end=6                                                                                                                          |
WriteThreadFunc thread 1 put key=keyabcde00000002 buf=Value00000005                                                                                              |
WriteThreadFunc thread 1 put key=keyabcde00000003 buf=Value00000001       Put older version <----+     Put newer version                |
ReadThreadFunc thread 2 key=keyabcde00000003 buf=Value00000005                                                |   Get older version  <-------------+
ReadThreadFunc thread 2 key=keyabcde00000003 buf=Value00000005                                                |
ReadThreadFunc thread 0 key=keyabcde00000001 buf=Value00000001                                                |
ReadThreadFunc thread 0 key=keyabcde00000001 buf=Value00000001                                                |
ReadThreadFunc thread 1 key=keyabcde00000002 buf=Value00000005                                                |
ReadThreadFunc thread 1 key=keyabcde00000002 buf=Value00000005                                                |
WriteThreadFunc thread 0 put key=keyabcde00000001 buf=Value00000002                                          |
WriteThreadFunc thread 2 put key=keyabcde00000004 buf=Value00000003                                          |
WriteThreadFunc thread 1 put key=keyabcde00000003 buf=Value00000002           Put newer version  |
ReadThreadFunc thread 2 key=keyabcde00000003 buf=Value00000001             Get older version ----+
 

Thanks 
Ping
 


Yong, Fan
 

Hi Ping,

 

Does “ReadThreadFunc thread 0” is the same thread as “WriteThreadFunc thread 0”? Or they are two different threads? Is there any concurrency control among the read threads and write threads? Or all the threads run in parallel without any control?

 

More inline comments.

 

--

Cheers,

Nasf

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of ping.wong via groups.io
Sent: Tuesday, February 9, 2021 1:50 PM
To: daos@daos.groups.io
Subject: [daos] Questions on DTX and KV Puts values visibility

 

Hi All,

 

In a two nodes cluster (one master/leader and one replica), I wrote a client application with multiple writer and reader threads against same object with different keys and values:

1. Writers issue daos_kv_put(oh, DAOS_TX_NONE, ..., &ev); call daos_event_test(&ev, DAOS_EQ_WAIT, &ev_flag);                // wait for IO completion

2. Readers issue daos_kv_get(oh, DAOS_TX_NONE, 0, key, &size, buf, &ev); call daos_event_test(&ev, DAOS_EQ_WAIT, &ev_flag); // wait for IO completion

 

I observed that daos_kv_get returns older version occasionally.  I have a few questions regarding DTX and the visibility of values obtained from the client's perspective.  

 

1. On the master/leader server, when it sends RPC reply to the client for the daos_kv_put(); 

     1.1 Has the put been committed before returning to the client?  If not, what is the typical asynchronous commit threshold to make the put values visible to other clients? 

[Nasf] By default, when the leader server reply the client, related update (put) is committable on server, but not really committed to the persistent storage. That is the typical asynchronously commit. But even if it is async commit, related update (put) is still visible to other clients as long as the leader server replied to the sponsored client. That is nothing related with the real commit to persistent storage. But if there is not communication between clients, then it is not easy to guarantee that the read on one client is after the write on another client.


     1.2 Under what condition(s), does the leader commit the transaction?

[Nasf] For async commit, there is dedicated ULT that will batched commit the committable DTX entries periodically. Two conditions: the committable DTX entries count exceeds the threshold or some DTX entries become too old.


     1.3 Does the leader send RPC to replica to commit the transaction synchronously or asynchronously after replying to the client?

[Nasf] The DTX batched commit ULT (not the IO handler) on the leader will async commit DTX entries after the leader replied related clients.

   

2. On the master server, how long does it wait (timeout) for the replica to reply?

[Nasf] It is the CaRT timeout, 60 seconds by default.

 
    2.1 Does the master server send RPC to replica if the leader does not get reply from replica (e.g. when replica dies)?

[Nasf] If replica (non-leader) dead, then the leader will get timeout, then related update (put) will fail.


    2.2 Under what condition(s) does the master server commit synchronously?

[Nasf] The application can require the leader to synchronously commit related update (put) via some RPC flags. Or when the leader cannot make async commit, it will synchronously commit related DTX.


The following is a partial output from the client application:

 

ReadThreadFunc thread 2 key=keyabcde00000006 buf=Value00000001

ReadThreadFunc thread 0 key=keyabcde00000004 buf=Value00000001

ReadThreadFunc thread 0 key=keyabcde00000004 buf=Value00000001

ReadThreadFunc thread 1 key=keyabcde00000005 buf=Value00000005

ReadThreadFunc thread 1 key=keyabcde00000005 buf=Value00000005

WriteThreadFunc thread 0 put key=keyabcde00000004 buf=Value00000003                             Put older version <--------+

WriteThreadFunc thread 1 put key=keyabcde00000006 buf=Value00000003                                                                     |

WriteThreadFunc thread 2 put key=keyabcde00000007 buf=Value00000004                                                                     |

ReadThreadFunc thread 2 key=keyabcde00000006 buf=Value00000002                                                                           |

ReadThreadFunc thread 2 key=keyabcde00000006 buf=Value00000002                                                                           |

ReadThreadFunc thread 0 key=keyabcde00000004 buf=Value00000002                                                                           |

ReadThreadFunc thread 0 key=keyabcde00000004 buf=Value00000002                                                                           |

ReadThreadFunc thread 1 key=keyabcde00000005 buf=Value00000005                                                                           |

ReadThreadFunc thread 1 key=keyabcde00000005 buf=Value00000005                                                                           |

WriteThreadFunc thread 0 put key=keyabcde00000004 buf=Value00000004                                  Put newer version      |

WriteThreadFunc thread 1 put key=keyabcde00000006 buf=Value00000004                                                                     |

WriteThreadFunc thread 2 put key=keyabcde00000007 buf=Value00000005                                                                     |

ReadThreadFunc thread 2 key=keyabcde00000006 buf=Value00000003                                                                           |

ReadThreadFunc thread 2 key=keyabcde00000006 buf=Value00000003                                                                           |

ReadThreadFunc thread 0 key=keyabcde00000004 buf=Value00000003                                  Get older version  <--------+

ReadThreadFunc thread 0 key=keyabcde00000004 buf=Value00000003

 

                                            ...

                        

WriteThreadFunc thread 2 put key=keyabcde00000003 buf=Value00000005                               Put older version <---------------------------+

Event Write thread 0 started buf_size=4096 start=1 end=5                                                                                                                           |

WriteThreadFunc thread 1 put key=keyabcde00000002 buf=Value00000004                                                                                              |

Event Read thread 2 started buf_size=4096 start=3 end=7                                                                                                                          |

WriteThreadFunc thread 2 put key=keyabcde00000004 buf=Value00000001                                                                                              |

Event Read thread 1 started buf_size=4096 start=2 end=6                                                                                                                          |

WriteThreadFunc thread 1 put key=keyabcde00000002 buf=Value00000005                                                                                              |

WriteThreadFunc thread 1 put key=keyabcde00000003 buf=Value00000001       Put older version <----+     Put newer version                |

ReadThreadFunc thread 2 key=keyabcde00000003 buf=Value00000005                                                |   Get older version  <-------------+

ReadThreadFunc thread 2 key=keyabcde00000003 buf=Value00000005                                                |

ReadThreadFunc thread 0 key=keyabcde00000001 buf=Value00000001                                                |

ReadThreadFunc thread 0 key=keyabcde00000001 buf=Value00000001                                                |

ReadThreadFunc thread 1 key=keyabcde00000002 buf=Value00000005                                                |

ReadThreadFunc thread 1 key=keyabcde00000002 buf=Value00000005                                                |

WriteThreadFunc thread 0 put key=keyabcde00000001 buf=Value00000002                                          |

WriteThreadFunc thread 2 put key=keyabcde00000004 buf=Value00000003                                          |

WriteThreadFunc thread 1 put key=keyabcde00000003 buf=Value00000002           Put newer version  |

ReadThreadFunc thread 2 key=keyabcde00000003 buf=Value00000001             Get older version ----+

 


Thanks 

Ping

 


ping.wong@...
 

Hi Nasf,

ReadThreadFunc thread 0 and  WriteThreadFunc thread 0 are two separate threads.  All threads are running in parallel without any control.

Ping


Yong, Fan
 

Then how to guarantee that the read request is sent out after the write request is replied? If the read value is old, how to know whether is the expected or not?

In your case, what the DAOS should (and can) guarantee is that some thread first read get value ‘a’, and then the same thread read the same key for the second time, get value ‘b’, if ‘b’ is not the same as ‘a’, then ‘b’ must be newer than ‘a’.

 

--

Cheers,

Nasf

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of ping.wong via groups.io
Sent: Tuesday, February 9, 2021 3:37 PM
To: daos@daos.groups.io
Subject: Re: [daos] Questions on DTX and KV Puts values visibility

 

Hi Nasf,

ReadThreadFunc thread 0 and  WriteThreadFunc thread 0 are two separate threads.  All threads are running in parallel without any control.

Ping


ping.wong@...
 

Hi Nasf,

You mentioned that when the leader reply the client, related put is committable on server, but not committed to persistent storage. 
However, I found that the Put has already been written to SSD via bio_rw for both leader and replica in earlier steps.  Please explain the asynchronous commit after to leader reply to client.  Are there any RPCs that the leader has to send to the replica indicating that the DTX is finally committed during the asynchronous commit phase?  How do the replica and server agree that they are both committed the DTX?

Ping
 


Yong, Fan
 

The “committable” means the DTX status, that controls related data visibility. Before the DTX to be committable, even if related data is already written to persistent storage, it is still invisible to clients. The DTX will become committable only when the leader executes related modification locally and get succeed replies from all related non-leader replicas. Once the DTX is committable on the leader, for async DTX (by default), the IO handler (ULT) will reply the client immediately. At that time, the DTX status on non-leader replicas is ‘prepared’. After that, sometime later, another ULT (the async batched commit ULT) on the leader will send DTX commit RPC to non-leader replicas that will persistently change the DTX status on all replicas. My former comment about “commit to persistent storage” means this step.

 

--

Cheers,

Nasf

 

From: daos@daos.groups.io <daos@daos.groups.io> On Behalf Of ping.wong via groups.io
Sent: Tuesday, February 9, 2021 4:39 PM
To: daos@daos.groups.io
Subject: Re: [daos] Questions on DTX and KV Puts values visibility

 

Hi Nasf,

You mentioned that when the leader reply the client, related put is committable on server, but not committed to persistent storage. 
However, I found that the Put has already been written to SSD via bio_rw for both leader and replica in earlier steps.  Please explain the asynchronous commit after to leader reply to client.  Are there any RPCs that the leader has to send to the replica indicating that the DTX is finally committed during the asynchronous commit phase?  How do the replica and server agree that they are both committed the DTX?

Ping